How to Fix 'memory not persisting when scaling' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21

memory-not-persisting-when-scalingllamaindexpython

When memory not persisting when scaling shows up in a LlamaIndex app, it usually means your chat state is living only inside one Python process. The app works on a single worker, then fails as soon as you add multiple replicas, restart the pod, or route the next request to a different instance.

In practice, this is almost never a LlamaIndex bug. It’s usually a deployment problem: in-memory chat history, session state, or ChatMemoryBuffer data is not shared across workers.

The Most Common Cause

The #1 cause is storing conversation memory in process-local Python objects. That works locally, but breaks when your app scales horizontally because each worker has its own memory.

Typical symptoms:

•First request works
•Follow-up request loses context
•Logs show new ChatMemoryBuffer instances being created
•In Kubernetes or Gunicorn, behavior changes depending on which pod handles the request

Here’s the broken pattern and the fixed pattern side by side:

# BROKEN: memory is local to one process
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.memory import ChatMemoryBuffer

docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)

def chat(user_id: str, message: str):
    # New memory every call = no persistence across requests
    memory = ChatMemoryBuffer.from_defaults(token_limit=3000)

    chat_engine = index.as_chat_engine(memory=memory)
    return chat_engine.chat(message)

# FIXED: persist memory outside the worker process
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.memory import ChatMemoryBuffer

docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)

# Example: load/save from Redis, Postgres, or another shared store
def load_history(user_id: str) -> list[dict]:
    return redis_client.get(f"chat:{user_id}") or []

def save_history(user_id: str, messages: list[dict]) -> None:
    redis_client.set(f"chat:{user_id}", messages)

def chat(user_id: str, message: str):
    history = load_history(user_id)

    memory = ChatMemoryBuffer.from_defaults(
        token_limit=3000,
        chat_history=history,
    )

    chat_engine = index.as_chat_engine(memory=memory)
    response = chat_engine.chat(message)

    save_history(user_id, memory.to_string())  # or serialize messages properly
    return response

The key point: ChatMemoryBuffer is not magic persistence. It’s an in-process buffer unless you explicitly back it with shared storage.

Other Possible Causes

1) You’re recreating the engine on every request

If you build index.as_chat_engine() inside your route handler without restoring state, each request starts clean.

# Bad
@app.post("/chat")
def chat(req: ChatRequest):
    engine = index.as_chat_engine()
    return engine.chat(req.message)

# Better
engine = index.as_chat_engine()

@app.post("/chat")
def chat(req: ChatRequest):
    return engine.chat(req.message)

If you need per-user memory, don’t share one global engine blindly. Share the index, but restore memory per session.

2) Your load balancer is sending requests to different pods

This happens a lot behind Kubernetes ingress or AWS ALB. One request hits pod A, the next hits pod B, and pod B has no idea what pod A stored in RAM.

# Example symptom in deployment config
replicas: 3
sessionAffinity: None

Fix options:

•Use Redis/Postgres for conversation state
•Enable sticky sessions only as a temporary workaround
•Don’t rely on pod-local RAM for user memory

3) You’re using `StorageContext` incorrectly

A common misunderstanding is assuming StorageContext persistence covers chat memory automatically. It does not.

# This persists the index/vector store, not your runtime chat session.
storage_context.persist(persist_dir="./storage")

If you want durable conversation state:

•Persist the vector store with StorageContext
•Persist chat history separately in your own store

These are different layers.

4) Your app restarts between requests

Serverless functions and auto-scaled containers can cold start at any time. If your code relies on module-level variables like this:

CHAT_MEMORY = {}

you will lose everything on restart.

Use a shared backend instead:

class RedisChatStore:
    def get(self, user_id: str): ...
    def set(self, user_id: str, value): ...

Then reconstruct ChatMemoryBuffer from stored history on each request.

How to Debug It

•
Check whether memory disappears only after scaling
- •Run locally with one worker.
- •
  Then run with multiple workers:
```
gunicorn app:app --workers 4
```
- •If it breaks only now, you have process-local state.
•
Log the worker identity
- •
  Print PID and hostname on every request:
```
import os, socket

print("pid=", os.getpid(), "host=", socket.gethostname())
```
- •If follow-up turns hit different workers and context vanishes, that’s your answer.
•
Inspect where ChatMemoryBuffer is created
- •
  Search for:
  - •ChatMemoryBuffer.from_defaults(...)
  - •index.as_chat_engine(...)
  - •chat_engine = ...
- •If these are inside request handlers without restore logic, you’re rebuilding state every call.
•
Verify persistence layer separately
- •Save a test payload to Redis/Postgres.
- •Read it back before constructing LlamaIndex objects.
- •If that fails, the bug is in your storage layer, not LlamaIndex.

Prevention

•Keep LlamaIndex indexes and retrieval logic separate from session memory.
•Store conversation history in Redis, Postgres, DynamoDB, or another shared system.
•Treat any object created inside a web request as disposable unless it’s backed by persistent storage.

If you want scaling-safe behavior with LlamaIndex Python apps, the rule is simple: keep embeddings and indexes persistent; keep chat history external; rebuild runtime objects from durable state on each request.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit