How to Fix 'memory not persisting' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
memory-not-persistingllamaindexpython

If you’re seeing memory not persisting in LlamaIndex, it usually means your chat history, vector memory, or session state is being recreated on every request instead of being loaded from the same storage backend. In practice, this shows up when you restart the process, switch between API calls, or instantiate a new ChatMemoryBuffer, Memory, or StorageContext each time.

The fix is usually not in LlamaIndex itself. It’s almost always in how you wire persistence, session IDs, and storage together.

The Most Common Cause

The #1 cause is creating memory in-process and assuming it will survive across requests.

A common pattern is to build a fresh ChatMemoryBuffer or agent on every call. That works for a single conversation turn, but the next request starts with empty memory because nothing was persisted to disk, Redis, SQLite, or another store.

Broken patternFixed pattern
Creates memory every requestReuses persisted storage
No persist_dir / no backendLoads from the same backend
Memory dies on process restartMemory survives restarts
# BROKEN: memory exists only in RAM for this request
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.chat_engine import CondensePlusContextChatEngine

def chat_once(query: str):
    memory = ChatMemoryBuffer.from_defaults(token_limit=4000)
    chat_engine = CondensePlusContextChatEngine.from_defaults(
        retriever=retriever,
        memory=memory,
    )
    return chat_engine.chat(query)
# FIXED: persist storage and reuse it across requests
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.chat_engine import CondensePlusContextChatEngine

PERSIST_DIR = "./storage"

storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)

# Load index once from persisted storage
index = load_index_from_storage(storage_context)

# Reuse the same memory backend per session/user
memory = ChatMemoryBuffer.from_defaults(token_limit=4000)

chat_engine = CondensePlusContextChatEngine.from_defaults(
    retriever=index.as_retriever(),
    memory=memory,
)

response = chat_engine.chat("What did I ask earlier?")
print(response)

If you want persistence across app restarts, you need both:

  • a persisted index or docstore
  • a durable conversation store keyed by user/session

If you only persist the index but recreate memory every time, LlamaIndex will still behave like it forgot the conversation.

Other Possible Causes

1) You’re using the wrong session key

If your app serves multiple users, each user needs a stable session ID. If that key changes between requests, you’ll get a fresh memory object every time.

# BROKEN: random session key per request
session_id = str(uuid.uuid4())
memory_store.get(session_id)  # always misses
# FIXED: stable key from auth/user context
session_id = request.user.id  # or tenant_id + user_id
memory_store.get(session_id)

2) You persist the index but not the chat store

A lot of people call index.storage_context.persist() and assume that includes conversation history. It doesn’t always cover what your agent/chat engine uses for messages.

# BROKEN: only persisting index artifacts
index.storage_context.persist(persist_dir="./storage")

Fix by explicitly persisting the memory backend you use:

  • Redis-backed message store
  • SQLite-backed store
  • custom docstore/message store

Example with a persistent message store:

from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.storage.chat_store import SimpleChatStore

chat_store = SimpleChatStore()
memory = ChatMemoryBuffer.from_defaults(chat_store=chat_store)

# Persist chat_store separately if supported by your backend setup

3) You are recreating the app worker/process

If you run multiple Uvicorn/Gunicorn workers or deploy on serverless infrastructure, in-memory state disappears between invocations.

uvicorn app:app --workers 4

Each worker has its own RAM. If one request hits worker A and the next hits worker B, your “memory” won’t be there.

Use an external store:

  • Redis for chat/session state
  • Postgres for durable audit/history
  • SQLite only for local development

4) You are mixing old and new LlamaIndex APIs

LlamaIndex has had API changes around memory and agents. Code written against older classes like ServiceContext or older chat engine patterns can fail silently or behave differently after upgrade.

Watch for mismatched imports like:

  • ServiceContext
  • legacy GPTSimpleVectorIndex
  • old agent constructors

Prefer current patterns from llama_index.core and verify versions:

import llama_index
print(llama_index.__version__)

If you upgraded recently and persistence broke right after that, check release notes before chasing phantom bugs.

How to Debug It

  1. Print the actual object identity

    • Confirm whether you are creating a new memory instance on every request.
    • If id(memory) changes between calls, persistence is not your problem yet; object lifetime is.
  2. Inspect where state is stored

    • Check whether you are using RAM-only objects like ChatMemoryBuffer() without a backing store.
    • Look for missing persist_dir, Redis config, or database connection strings.
  3. Verify session keys

    • Log the user/session identifier used to load memory.
    • If it changes between requests, you are writing to one bucket and reading from another.
  4. Restart the process and test again

    • If memory disappears after restart, it was never persisted.
    • If it survives restart but not across users/requests, your keying strategy is wrong.

A simple debug print helps:

print("session_id:", session_id)
print("memory type:", type(memory))
print("storage dir:", PERSIST_DIR)

If you see session_id changing unexpectedly or no persistent backend configured at all, that’s your answer.

Prevention

  • Use an external persistence layer for any conversation state that must survive restarts.
  • Key memory by stable identifiers: user_id, conversation_id, or both.
  • Add a startup check that loads persisted storage and fails fast if files/config are missing.
  • Write one integration test that:
    • sends message one,
    • restarts the process,
    • sends message two,
    • asserts prior context is still available.

The main rule is simple: if it matters after the request ends, don’t keep it only in Python memory.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides