How to Fix 'memory not persisting in production' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21

memory-not-persisting-in-productionllamaindexpython

What the error means

If memory works locally but resets in production, you usually have one of two problems: the memory object is being recreated on every request, or your deployment is stateless and you never persist chat state anywhere durable.

In LlamaIndex Python, this often shows up as behavior like:

•ChatMemoryBuffer starts empty on each request
•Context.chat_engine.chat(...) forgets prior turns
•a session works in one worker, then “loses memory” on the next request

The Most Common Cause

The #1 cause is instantiating memory inside the request handler instead of keeping it tied to a stable session or persistent store.

Here’s the broken pattern:

from fastapi import FastAPI, Request
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.memory import ChatMemoryBuffer

app = FastAPI()

docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)

@app.post("/chat")
async def chat(request: Request):
    payload = await request.json()

    # WRONG: new memory every request
    memory = ChatMemoryBuffer.from_defaults(token_limit=4000)

    chat_engine = index.as_chat_engine(memory=memory)
    response = chat_engine.chat(payload["message"])

    return {"answer": str(response)}

And here’s the fixed pattern:

from fastapi import FastAPI, Request
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.memory import ChatMemoryBuffer

app = FastAPI()

docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)

# Better: keep per-session memory outside the request lifecycle
session_memory_store = {}

@app.post("/chat")
async def chat(request: Request):
    payload = await request.json()
    session_id = payload["session_id"]

    if session_id not in session_memory_store:
        session_memory_store[session_id] = ChatMemoryBuffer.from_defaults(token_limit=4000)

    memory = session_memory_store[session_id]
    chat_engine = index.as_chat_engine(memory=memory)
    response = chat_engine.chat(payload["message"])

    return {"answer": str(response)}

The important part is that ChatMemoryBuffer is not a global “conversation database.” It is an in-memory buffer. If your app runs behind Gunicorn, Uvicorn workers, Kubernetes replicas, or serverless functions, that buffer will disappear whenever the process changes.

Other Possible Causes

1) You are using multiple workers or replicas

If requests from the same user hit different processes, each process has its own memory.

# Problematic for in-memory chat state
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker

Fix by storing memory in Redis, Postgres, DynamoDB, or another shared backend.

2) You are recreating the `StorageContext` without persistence

If you build indexes or stores from scratch every time, your agent can’t recover prior state.

# WRONG: ephemeral storage context
from llama_index.core import StorageContext

storage_context = StorageContext.from_defaults()
index = VectorStoreIndex.from_documents(docs, storage_context=storage_context)

Persist it:

storage_context.persist(persist_dir="./storage")
# later
storage_context = StorageContext.from_defaults(persist_dir="./storage")

3) You are confusing `memory` with `chat_history`

Some LlamaIndex APIs accept a message history object; others expect a memory implementation. Passing the wrong type can silently degrade behavior or produce errors like:

•AttributeError: 'list' object has no attribute 'get_all'
•TypeError: expected BaseMemory instance

Bad:

chat_history = []
chat_engine = index.as_chat_engine(chat_history=chat_history)

Good:

from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=4000)
chat_engine = index.as_chat_engine(memory=memory)

4) Your app restarts on deploys and you never rehydrate state

A fresh pod means a fresh Python process. If you only keep conversation state in RAM, production deploys will wipe it.

Use a persistent store for conversation state:

# Pseudocode for durable storage
redis.set(f"chat:{session_id}", memory.to_string())
# on startup / request:
# restore from redis before calling chat_engine.chat(...)

How to Debug It

•
Print the process ID and worker name
- •If the same session lands on different PIDs, you have a multi-worker/session affinity problem.
```
import os
print("pid:", os.getpid())
```
•
Log whether memory contains prior messages before each turn
- •If it’s empty every time, you’re recreating it.
```
print(memory.get_all())
```
•
Check whether your deployment is stateless
- •Look for Kubernetes replicas, autoscaling pods, Lambda-style handlers, or multiple Gunicorn workers.
- •If yes, RAM-backed memory will not persist reliably.
•
Verify you’re using the right LlamaIndex class
- •For conversational state, use ChatMemoryBuffer or another concrete BaseMemory.
- •For persistence across requests/processes, add an external store; don’t assume LlamaIndex keeps it for you.

Prevention

•Keep conversation state keyed by session_id, not by request.
•Use a shared persistence layer if you run more than one worker or replica.
•Persist indexes and storage contexts explicitly instead of rebuilding them on startup unless that’s intentional.
•Add a regression test that sends two requests with the same session and asserts the second response sees prior context.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit