How to Fix 'memory not persisting in production' in LlamaIndex (TypeScript)
What this error actually means
If your LlamaIndex TypeScript app works in local tests but the conversation memory resets in production, the issue is usually not “memory is broken.” It means the memory object is either being recreated on every request, never persisted to a shared store, or lost because your deployment model is stateless.
The most common symptom is this: you call chatEngine.chat() or agent.chat(), get a sensible reply, then the next request behaves like a brand-new session. In logs, you may also see errors like No existing chat history found for key ... or MemoryBufferWindow.fromDefaults is empty after restart, depending on how you wired persistence.
The Most Common Cause
The #1 cause is instantiating ChatMemoryBuffer or ContextChatEngine inside a request handler without persisting the backing store. In production, every serverless invocation, container restart, or hot reload gives you a fresh in-memory object.
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Memory created per request | Memory created once and backed by persistent storage |
| Uses default in-memory state | Uses SimpleDocumentStore, Redis, Postgres, or another shared backend |
| Works locally in one process | Survives restarts and multiple instances |
// BROKEN: memory dies after each request
import { OpenAI } from "llamaindex";
import { ChatMemoryBuffer, ContextChatEngine } from "llamaindex";
export async function POST(req: Request) {
const { message } = await req.json();
const llm = new OpenAI({ model: "gpt-4o-mini" });
// New buffer every request = no persistence
const memory = ChatMemoryBuffer.fromDefaults({
tokenLimit: 3000,
});
const chatEngine = new ContextChatEngine({
chatModel: llm,
memory,
retriever: myRetriever,
});
const response = await chatEngine.chat(message);
return Response.json({ answer: response.response });
}
// FIXED: reuse a shared persistent store
import { OpenAI, ChatMemoryBuffer } from "llamaindex";
import { RedisKVStore } from "./redis-kv-store"; // your implementation
const kvStore = new RedisKVStore(process.env.REDIS_URL!);
export async function getMemory(sessionId: string) {
return ChatMemoryBuffer.fromDefaults({
tokenLimit: 3000,
chatStore: kvStore,
sessionId,
});
}
export async function POST(req: Request) {
const { message, sessionId } = await req.json();
const llm = new OpenAI({ model: "gpt-4o-mini" });
const memory = await getMemory(sessionId);
const chatEngine = new ContextChatEngine({
chatModel: llm,
memory,
retriever: myRetriever,
});
const response = await chatEngine.chat(message);
return Response.json({ answer: response.response });
}
The important part is not the exact storage backend. It’s that the memory must be keyed by session and backed by something your production instance can read again later.
Other Possible Causes
1. You are not passing a stable session identifier
If every request gets a new sessionId, LlamaIndex will happily create a new conversation thread each time.
// BAD
const sessionId = crypto.randomUUID();
// GOOD
const sessionId = req.headers.get("x-session-id") ?? user.id;
Use something stable:
- •authenticated user ID
- •browser cookie
- •signed session token
- •conversation ID stored in your DB
2. Your deployment is stateless and you are using local process memory
This happens when you use Node process memory on:
- •Vercel serverless functions
- •AWS Lambda
- •Kubernetes with multiple replicas
- •Docker containers that restart often
// BAD: process-local cache only
const sessions = new Map<string, ChatMemoryBuffer>();
That works on your laptop. In production, one instance stores the state and another instance handles the next request.
Fix it with:
- •Redis
- •Postgres
- •DynamoDB
- •MongoDB
- •any shared durable store
3. You are recreating the agent instead of rehydrating it
Some developers persist messages but rebuild the agent with empty memory every time.
// BAD
const agent = await createAgent(); // creates fresh memory internally
await agent.chat(input);
Instead, load memory first and inject it:
// GOOD
const memory = await loadMemory(sessionId);
const agent = await createAgent({ memory });
await agent.chat(input);
If you use ReActAgent, OpenAIAgent, or similar classes, treat them as stateless wrappers around persisted conversation state.
4. Token limits are truncating older messages too aggressively
Sometimes memory is “persisting,” but only the last few turns remain because your window is too small.
const memory = ChatMemoryBuffer.fromDefaults({
tokenLimit: 500,
});
That can make it look like history vanished after only a couple of exchanges.
Raise the limit carefully:
const memory = ChatMemoryBuffer.fromDefaults({
tokenLimit: 4000,
});
Or switch to a summary-based approach if long conversations matter more than raw turn-by-turn retention.
How to Debug It
- •
Log the session key on every request
- •Print
sessionId, user ID, and any conversation ID. - •If it changes between requests, that’s your bug.
- •Print
- •
Inspect whether messages are actually written to storage
- •After each turn, query Redis/Postgres/your store directly.
- •If nothing is saved, your persistence layer isn’t wired correctly.
- •
Check whether multiple instances are serving different requests
- •Add hostname/pod name to logs.
- •If request A writes state on pod-1 and request B lands on pod-7 with local memory only, history disappears.
- •
Verify token window behavior
- •Temporarily increase
tokenLimit. - •If older turns reappear, you were truncating too early rather than losing persistence.
- •Temporarily increase
A good debug log line looks like this:
console.log({
sessionId,
pod: process.env.HOSTNAME,
tokenLimit: 3000,
});
If you see different pods and no shared store, you found the problem.
Prevention
- •Use a persistent backing store from day one. Local
Map()caches are fine for demos, not for production agents. - •Make session identity explicit. Pass
sessionIdthrough every layer instead of generating random IDs inside handlers. - •Test restarts and multi-instance routing before shipping. Kill the process mid-conversation and confirm history survives.
If you’re building anything customer-facing with LlamaIndex TypeScript, assume your app will be restarted and scaled horizontally. Design memory as infrastructure, not as an object living inside a request handler.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit