How to Fix 'OOM error during inference when scaling' in LlamaIndex (Python)
When you see OOM error during inference when scaling in a LlamaIndex app, it usually means your process ran out of memory while the model was generating or embedding at higher concurrency. In practice, this shows up when you move from one request at a time to multiple parallel queries, larger documents, or bigger context windows.
The failure is often not in LlamaIndex itself. It’s usually the way the index, retriever, or LLM is being used under load.
The Most Common Cause
The #1 cause is loading too much into memory per request, then multiplying that by concurrency.
In LlamaIndex, this usually happens when people:
- •build the index from a huge document set in-process
- •use a large
Settings.chunk_size - •run multiple
.query()or.achat()calls in parallel - •keep
Responseobjects, source nodes, and full chat history around longer than needed
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Build index + query in the same hot path | Build once, persist it, reuse it |
| Large chunks and high top-k | Smaller chunks and tighter retrieval |
| Parallel requests without limits | Bound concurrency |
# BROKEN: reloading and rebuilding on every request
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
def answer_question(question: str):
docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs) # expensive and memory-heavy
query_engine = index.as_query_engine(similarity_top_k=20)
response = query_engine.query(question)
return str(response)
# FIXED: build once, persist, and reuse
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
def build_index_once():
docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(persist_dir="./storage")
def get_query_engine():
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
return index.as_query_engine(similarity_top_k=5)
query_engine = get_query_engine()
def answer_question(question: str):
response = query_engine.query(question)
return str(response)
If you’re using an API server like FastAPI or Flask, rebuilding inside each request handler is the fastest way to trigger OOM under load.
Other Possible Causes
1. Too many concurrent inference calls
If you fan out requests without a cap, memory spikes fast. This is common with async workers or background jobs.
# risky: no concurrency control
results = await asyncio.gather(*[query_engine.aquery(q) for q in questions])
Use a semaphore:
sem = asyncio.Semaphore(2)
async def limited_query(q):
async with sem:
return await query_engine.aquery(q)
results = await asyncio.gather(*[limited_query(q) for q in questions])
2. Embedding large batches at once
Bulk ingestion can OOM before inference even starts. SentenceSplitter, embedding models, and vector stores all consume RAM during ingestion.
# risky: huge batch ingestion
nodes = splitter.get_nodes_from_documents(docs) # massive docs => huge node list
index = VectorStoreIndex(nodes)
Prefer smaller batches:
for batch in batched(docs, 50):
nodes = splitter.get_nodes_from_documents(batch)
VectorStoreIndex(nodes).storage_context.persist("./storage")
3. Returning too much context to the LLM
A high similarity_top_k plus large chunks means the prompt balloons. That increases token usage and memory pressure during generation.
query_engine = index.as_query_engine(
similarity_top_k=20,
response_mode="compact",
)
Tighten retrieval:
query_engine = index.as_query_engine(
similarity_top_k=4,
response_mode="compact",
)
If you need better recall, use reranking instead of just increasing top_k.
4. Holding onto full response objects
Some apps store Response, source_nodes, or entire chat histories in memory for logging or downstream processing.
response = query_engine.query(question)
cache.append(response) # keeps references alive
Extract only what you need:
response = query_engine.query(question)
cache.append({
"answer": str(response),
"sources": [n.node_id for n in response.source_nodes[:3]],
})
How to Debug It
- •
Confirm whether it’s ingestion or inference
- •If memory spikes during
from_documents(), your problem is indexing. - •If it spikes during
.query()or.aquery(), it’s inference-time prompt/context growth.
- •If memory spikes during
- •
Check your retrieval settings
- •Inspect
similarity_top_k, chunk size, and response mode. - •Start with:
query_engine = index.as_query_engine(similarity_top_k=3)
- •Inspect
- •
Measure process memory around each stage
- •Add logging before and after document loading, indexing, retrieval, and generation.
- •Watch RSS with
psutil:import os, psutil print(psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024, "MB")
- •
Reduce concurrency to 1
- •If the error disappears at single-threaded execution, you have a scaling problem.
- •Then add back concurrency gradually until you find the limit.
Prevention
- •Persist indexes to disk and load them at runtime instead of rebuilding them per request.
- •Keep chunk sizes and
similarity_top_ksmall unless you have hard evidence they need to be larger. - •Put a hard cap on concurrent queries in your API worker pool or async pipeline.
If you’re running LlamaIndex in production on Python, treat memory as part of your API contract. The fix is usually not “more RAM” — it’s less work per request and less work happening at once.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit