Haystack Tutorial (Python): adding memory to agents for advanced developers

By Cyprian AaronsUpdated 2026-04-21

haystackadding-memory-to-agents-for-advanced-developerspython

This tutorial shows how to give a Haystack agent persistent memory in Python, so it can remember prior turns, user preferences, and task context across interactions. You need this when a single request/response loop is not enough and the agent has to behave consistently over multiple calls.

What You'll Need

•Python 3.10+
•haystack-ai
•An OpenAI API key
•Optional: a .env file for local development
•Basic familiarity with Haystack Pipeline, ChatMessage, and tool calling

Install the package:

pip install haystack-ai python-dotenv

Set your API key:

export OPENAI_API_KEY="your-key"

Step-by-Step

•Start with a minimal agent setup that can talk to an LLM and call tools. The memory layer will sit between user input and the agent, so keep the base pipeline clean.

from haystack import Pipeline
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage

llm = OpenAIChatGenerator(model="gpt-4o-mini")

pipe = Pipeline()
pipe.add_component("llm", llm)

messages = [
    ChatMessage.from_system("You are a concise assistant."),
    ChatMessage.from_user("Remember that I prefer JSON responses."),
]

result = pipe.run({"llm": {"messages": messages}})
print(result["llm"]["replies"][0].content)

•Add a memory store using an in-process list for short-term state, then persist it between turns. In production you would swap this for Redis, Postgres, or another durable store, but the pattern stays the same.

from dataclasses import dataclass, field
from typing import List

@dataclass
class ConversationMemory:
    messages: List[ChatMessage] = field(default_factory=list)

    def append_turn(self, user_text: str, assistant_text: str) -> None:
        self.messages.append(ChatMessage.from_user(user_text))
        self.messages.append(ChatMessage.from_assistant(assistant_text))

    def as_messages(self) -> List[ChatMessage]:
        return list(self.messages)

memory = ConversationMemory()
memory.messages.append(ChatMessage.from_system("You are a concise assistant."))

•Build a function that injects memory into each new prompt before calling the model. This is the important part: the agent does not “magically” remember anything unless you pass prior context back in.

def chat_with_memory(user_text: str) -> str:
    messages = memory.as_messages() + [ChatMessage.from_user(user_text)]
    result = pipe.run({"llm": {"messages": messages}})
    reply = result["llm"]["replies"][0].content
    memory.append_turn(user_text, reply)
    return reply

print(chat_with_memory("My name is Priya and I work in claims."))
print(chat_with_memory("What team do I work in?"))

•Add summarization so memory does not grow forever. For real agents, you want recent turns plus a compact summary of older context, otherwise token usage gets expensive and retrieval quality drops.

def summarize_memory() -> None:
    if len(memory.messages) < 6:
        return

    recent = memory.messages[-4:]
    summary_prompt = [
        ChatMessage.from_system("Summarize the conversation facts relevant for future turns."),
        *memory.messages[:-4],
        ChatMessage.from_user("Produce a short memory summary."),
    ]
    summary_result = pipe.run({"llm": {"messages": summary_prompt}})
    summary_text = summary_result["llm"]["replies"][0].content

    memory.messages = [
        ChatMessage.from_system(f"Conversation summary: {summary_text}"),
        *recent,
    ]

•Wire summarization into your chat loop so every few turns you compact older state automatically. This keeps the same interface while making the memory layer production-friendly.

def chat_with_compaction(user_text: str) -> str:
    summarize_memory()
    messages = memory.as_messages() + [ChatMessage.from_user(user_text)]
    result = pipe.run({"llm": {"messages": messages}})
    reply = result["llm"]["replies"][0].content
    memory.append_turn(user_text, reply)
    return reply

print(chat_with_compaction("I also prefer bullet points over long paragraphs."))
print(chat_with_compaction("What format should you use when answering me?"))

Testing It

Run the script and ask for a fact in one turn, then query it again in a later turn. If memory is wired correctly, the second response should reflect earlier user preferences or identity details without you re-sending them explicitly.

Check that older context still influences replies after several exchanges, but that token growth stays bounded once summarization kicks in. If responses start drifting, inspect whether your stored messages include the system prompt and whether compaction is overwriting important facts.

For a stronger test, restart only the model client but keep your ConversationMemory object alive. The conversation should still continue from prior state because persistence lives outside the LLM call.

Next Steps

•Replace ConversationMemory with Redis or Postgres-backed storage for multi-session persistence.
•Add retrieval over long-term memories using Haystack document stores and embeddings.
•Split memory into categories: user profile, task state, and conversation history.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit