How to Fix 'memory not persisting in production' in LlamaIndex (Python)
What the error means
If memory works locally but resets in production, you usually have one of two problems: the memory object is being recreated on every request, or your deployment is stateless and you never persist chat state anywhere durable.
In LlamaIndex Python, this often shows up as behavior like:
- •
ChatMemoryBufferstarts empty on each request - •
Context.chat_engine.chat(...)forgets prior turns - •a session works in one worker, then “loses memory” on the next request
The Most Common Cause
The #1 cause is instantiating memory inside the request handler instead of keeping it tied to a stable session or persistent store.
Here’s the broken pattern:
from fastapi import FastAPI, Request
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.memory import ChatMemoryBuffer
app = FastAPI()
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)
@app.post("/chat")
async def chat(request: Request):
payload = await request.json()
# WRONG: new memory every request
memory = ChatMemoryBuffer.from_defaults(token_limit=4000)
chat_engine = index.as_chat_engine(memory=memory)
response = chat_engine.chat(payload["message"])
return {"answer": str(response)}
And here’s the fixed pattern:
from fastapi import FastAPI, Request
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.memory import ChatMemoryBuffer
app = FastAPI()
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)
# Better: keep per-session memory outside the request lifecycle
session_memory_store = {}
@app.post("/chat")
async def chat(request: Request):
payload = await request.json()
session_id = payload["session_id"]
if session_id not in session_memory_store:
session_memory_store[session_id] = ChatMemoryBuffer.from_defaults(token_limit=4000)
memory = session_memory_store[session_id]
chat_engine = index.as_chat_engine(memory=memory)
response = chat_engine.chat(payload["message"])
return {"answer": str(response)}
The important part is that ChatMemoryBuffer is not a global “conversation database.” It is an in-memory buffer. If your app runs behind Gunicorn, Uvicorn workers, Kubernetes replicas, or serverless functions, that buffer will disappear whenever the process changes.
Other Possible Causes
1) You are using multiple workers or replicas
If requests from the same user hit different processes, each process has its own memory.
# Problematic for in-memory chat state
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker
Fix by storing memory in Redis, Postgres, DynamoDB, or another shared backend.
2) You are recreating the StorageContext without persistence
If you build indexes or stores from scratch every time, your agent can’t recover prior state.
# WRONG: ephemeral storage context
from llama_index.core import StorageContext
storage_context = StorageContext.from_defaults()
index = VectorStoreIndex.from_documents(docs, storage_context=storage_context)
Persist it:
storage_context.persist(persist_dir="./storage")
# later
storage_context = StorageContext.from_defaults(persist_dir="./storage")
3) You are confusing memory with chat_history
Some LlamaIndex APIs accept a message history object; others expect a memory implementation. Passing the wrong type can silently degrade behavior or produce errors like:
- •
AttributeError: 'list' object has no attribute 'get_all' - •
TypeError: expected BaseMemory instance
Bad:
chat_history = []
chat_engine = index.as_chat_engine(chat_history=chat_history)
Good:
from llama_index.core.memory import ChatMemoryBuffer
memory = ChatMemoryBuffer.from_defaults(token_limit=4000)
chat_engine = index.as_chat_engine(memory=memory)
4) Your app restarts on deploys and you never rehydrate state
A fresh pod means a fresh Python process. If you only keep conversation state in RAM, production deploys will wipe it.
Use a persistent store for conversation state:
# Pseudocode for durable storage
redis.set(f"chat:{session_id}", memory.to_string())
# on startup / request:
# restore from redis before calling chat_engine.chat(...)
How to Debug It
- •
Print the process ID and worker name
- •If the same session lands on different PIDs, you have a multi-worker/session affinity problem.
import os print("pid:", os.getpid()) - •
Log whether memory contains prior messages before each turn
- •If it’s empty every time, you’re recreating it.
print(memory.get_all()) - •
Check whether your deployment is stateless
- •Look for Kubernetes replicas, autoscaling pods, Lambda-style handlers, or multiple Gunicorn workers.
- •If yes, RAM-backed memory will not persist reliably.
- •
Verify you’re using the right LlamaIndex class
- •For conversational state, use
ChatMemoryBufferor another concreteBaseMemory. - •For persistence across requests/processes, add an external store; don’t assume LlamaIndex keeps it for you.
- •For conversational state, use
Prevention
- •Keep conversation state keyed by
session_id, not by request. - •Use a shared persistence layer if you run more than one worker or replica.
- •Persist indexes and storage contexts explicitly instead of rebuilding them on startup unless that’s intentional.
- •Add a regression test that sends two requests with the same session and asserts the second response sees prior context.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit