How to Fix 'memory not persisting when scaling' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
memory-not-persisting-when-scalingllamaindexpython

When memory not persisting when scaling shows up in a LlamaIndex app, it usually means your chat state is living only inside one Python process. The app works on a single worker, then fails as soon as you add multiple replicas, restart the pod, or route the next request to a different instance.

In practice, this is almost never a LlamaIndex bug. It’s usually a deployment problem: in-memory chat history, session state, or ChatMemoryBuffer data is not shared across workers.

The Most Common Cause

The #1 cause is storing conversation memory in process-local Python objects. That works locally, but breaks when your app scales horizontally because each worker has its own memory.

Typical symptoms:

  • First request works
  • Follow-up request loses context
  • Logs show new ChatMemoryBuffer instances being created
  • In Kubernetes or Gunicorn, behavior changes depending on which pod handles the request

Here’s the broken pattern and the fixed pattern side by side:

# BROKEN: memory is local to one process
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.memory import ChatMemoryBuffer

docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)

def chat(user_id: str, message: str):
    # New memory every call = no persistence across requests
    memory = ChatMemoryBuffer.from_defaults(token_limit=3000)

    chat_engine = index.as_chat_engine(memory=memory)
    return chat_engine.chat(message)
# FIXED: persist memory outside the worker process
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.memory import ChatMemoryBuffer

docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)

# Example: load/save from Redis, Postgres, or another shared store
def load_history(user_id: str) -> list[dict]:
    return redis_client.get(f"chat:{user_id}") or []

def save_history(user_id: str, messages: list[dict]) -> None:
    redis_client.set(f"chat:{user_id}", messages)

def chat(user_id: str, message: str):
    history = load_history(user_id)

    memory = ChatMemoryBuffer.from_defaults(
        token_limit=3000,
        chat_history=history,
    )

    chat_engine = index.as_chat_engine(memory=memory)
    response = chat_engine.chat(message)

    save_history(user_id, memory.to_string())  # or serialize messages properly
    return response

The key point: ChatMemoryBuffer is not magic persistence. It’s an in-process buffer unless you explicitly back it with shared storage.

Other Possible Causes

1) You’re recreating the engine on every request

If you build index.as_chat_engine() inside your route handler without restoring state, each request starts clean.

# Bad
@app.post("/chat")
def chat(req: ChatRequest):
    engine = index.as_chat_engine()
    return engine.chat(req.message)
# Better
engine = index.as_chat_engine()

@app.post("/chat")
def chat(req: ChatRequest):
    return engine.chat(req.message)

If you need per-user memory, don’t share one global engine blindly. Share the index, but restore memory per session.

2) Your load balancer is sending requests to different pods

This happens a lot behind Kubernetes ingress or AWS ALB. One request hits pod A, the next hits pod B, and pod B has no idea what pod A stored in RAM.

# Example symptom in deployment config
replicas: 3
sessionAffinity: None

Fix options:

  • Use Redis/Postgres for conversation state
  • Enable sticky sessions only as a temporary workaround
  • Don’t rely on pod-local RAM for user memory

3) You’re using StorageContext incorrectly

A common misunderstanding is assuming StorageContext persistence covers chat memory automatically. It does not.

# This persists the index/vector store, not your runtime chat session.
storage_context.persist(persist_dir="./storage")

If you want durable conversation state:

  • Persist the vector store with StorageContext
  • Persist chat history separately in your own store

These are different layers.

4) Your app restarts between requests

Serverless functions and auto-scaled containers can cold start at any time. If your code relies on module-level variables like this:

CHAT_MEMORY = {}

you will lose everything on restart.

Use a shared backend instead:

class RedisChatStore:
    def get(self, user_id: str): ...
    def set(self, user_id: str, value): ...

Then reconstruct ChatMemoryBuffer from stored history on each request.

How to Debug It

  1. Check whether memory disappears only after scaling

    • Run locally with one worker.
    • Then run with multiple workers:
      gunicorn app:app --workers 4
      
    • If it breaks only now, you have process-local state.
  2. Log the worker identity

    • Print PID and hostname on every request:
      import os, socket
      
      print("pid=", os.getpid(), "host=", socket.gethostname())
      
    • If follow-up turns hit different workers and context vanishes, that’s your answer.
  3. Inspect where ChatMemoryBuffer is created

    • Search for:
      • ChatMemoryBuffer.from_defaults(...)
      • index.as_chat_engine(...)
      • chat_engine = ...
    • If these are inside request handlers without restore logic, you’re rebuilding state every call.
  4. Verify persistence layer separately

    • Save a test payload to Redis/Postgres.
    • Read it back before constructing LlamaIndex objects.
    • If that fails, the bug is in your storage layer, not LlamaIndex.

Prevention

  • Keep LlamaIndex indexes and retrieval logic separate from session memory.
  • Store conversation history in Redis, Postgres, DynamoDB, or another shared system.
  • Treat any object created inside a web request as disposable unless it’s backed by persistent storage.

If you want scaling-safe behavior with LlamaIndex Python apps, the rule is simple: keep embeddings and indexes persistent; keep chat history external; rebuild runtime objects from durable state on each request.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides