How to Fix 'memory not persisting when scaling' in LlamaIndex (Python)
When memory not persisting when scaling shows up in a LlamaIndex app, it usually means your chat state is living only inside one Python process. The app works on a single worker, then fails as soon as you add multiple replicas, restart the pod, or route the next request to a different instance.
In practice, this is almost never a LlamaIndex bug. It’s usually a deployment problem: in-memory chat history, session state, or ChatMemoryBuffer data is not shared across workers.
The Most Common Cause
The #1 cause is storing conversation memory in process-local Python objects. That works locally, but breaks when your app scales horizontally because each worker has its own memory.
Typical symptoms:
- •First request works
- •Follow-up request loses context
- •Logs show new
ChatMemoryBufferinstances being created - •In Kubernetes or Gunicorn, behavior changes depending on which pod handles the request
Here’s the broken pattern and the fixed pattern side by side:
# BROKEN: memory is local to one process
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.memory import ChatMemoryBuffer
docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)
def chat(user_id: str, message: str):
# New memory every call = no persistence across requests
memory = ChatMemoryBuffer.from_defaults(token_limit=3000)
chat_engine = index.as_chat_engine(memory=memory)
return chat_engine.chat(message)
# FIXED: persist memory outside the worker process
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.memory import ChatMemoryBuffer
docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)
# Example: load/save from Redis, Postgres, or another shared store
def load_history(user_id: str) -> list[dict]:
return redis_client.get(f"chat:{user_id}") or []
def save_history(user_id: str, messages: list[dict]) -> None:
redis_client.set(f"chat:{user_id}", messages)
def chat(user_id: str, message: str):
history = load_history(user_id)
memory = ChatMemoryBuffer.from_defaults(
token_limit=3000,
chat_history=history,
)
chat_engine = index.as_chat_engine(memory=memory)
response = chat_engine.chat(message)
save_history(user_id, memory.to_string()) # or serialize messages properly
return response
The key point: ChatMemoryBuffer is not magic persistence. It’s an in-process buffer unless you explicitly back it with shared storage.
Other Possible Causes
1) You’re recreating the engine on every request
If you build index.as_chat_engine() inside your route handler without restoring state, each request starts clean.
# Bad
@app.post("/chat")
def chat(req: ChatRequest):
engine = index.as_chat_engine()
return engine.chat(req.message)
# Better
engine = index.as_chat_engine()
@app.post("/chat")
def chat(req: ChatRequest):
return engine.chat(req.message)
If you need per-user memory, don’t share one global engine blindly. Share the index, but restore memory per session.
2) Your load balancer is sending requests to different pods
This happens a lot behind Kubernetes ingress or AWS ALB. One request hits pod A, the next hits pod B, and pod B has no idea what pod A stored in RAM.
# Example symptom in deployment config
replicas: 3
sessionAffinity: None
Fix options:
- •Use Redis/Postgres for conversation state
- •Enable sticky sessions only as a temporary workaround
- •Don’t rely on pod-local RAM for user memory
3) You’re using StorageContext incorrectly
A common misunderstanding is assuming StorageContext persistence covers chat memory automatically. It does not.
# This persists the index/vector store, not your runtime chat session.
storage_context.persist(persist_dir="./storage")
If you want durable conversation state:
- •Persist the vector store with
StorageContext - •Persist chat history separately in your own store
These are different layers.
4) Your app restarts between requests
Serverless functions and auto-scaled containers can cold start at any time. If your code relies on module-level variables like this:
CHAT_MEMORY = {}
you will lose everything on restart.
Use a shared backend instead:
class RedisChatStore:
def get(self, user_id: str): ...
def set(self, user_id: str, value): ...
Then reconstruct ChatMemoryBuffer from stored history on each request.
How to Debug It
- •
Check whether memory disappears only after scaling
- •Run locally with one worker.
- •Then run with multiple workers:
gunicorn app:app --workers 4 - •If it breaks only now, you have process-local state.
- •
Log the worker identity
- •Print PID and hostname on every request:
import os, socket print("pid=", os.getpid(), "host=", socket.gethostname()) - •If follow-up turns hit different workers and context vanishes, that’s your answer.
- •Print PID and hostname on every request:
- •
Inspect where
ChatMemoryBufferis created- •Search for:
- •
ChatMemoryBuffer.from_defaults(...) - •
index.as_chat_engine(...) - •
chat_engine = ...
- •
- •If these are inside request handlers without restore logic, you’re rebuilding state every call.
- •Search for:
- •
Verify persistence layer separately
- •Save a test payload to Redis/Postgres.
- •Read it back before constructing LlamaIndex objects.
- •If that fails, the bug is in your storage layer, not LlamaIndex.
Prevention
- •Keep LlamaIndex indexes and retrieval logic separate from session memory.
- •Store conversation history in Redis, Postgres, DynamoDB, or another shared system.
- •Treat any object created inside a web request as disposable unless it’s backed by persistent storage.
If you want scaling-safe behavior with LlamaIndex Python apps, the rule is simple: keep embeddings and indexes persistent; keep chat history external; rebuild runtime objects from durable state on each request.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit