How to Fix 'OOM error during inference when scaling' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inference-when-scalingllamaindexpython

When you see OOM error during inference when scaling in a LlamaIndex app, it usually means your process ran out of memory while the model was generating or embedding at higher concurrency. In practice, this shows up when you move from one request at a time to multiple parallel queries, larger documents, or bigger context windows.

The failure is often not in LlamaIndex itself. It’s usually the way the index, retriever, or LLM is being used under load.

The Most Common Cause

The #1 cause is loading too much into memory per request, then multiplying that by concurrency.

In LlamaIndex, this usually happens when people:

  • build the index from a huge document set in-process
  • use a large Settings.chunk_size
  • run multiple .query() or .achat() calls in parallel
  • keep Response objects, source nodes, and full chat history around longer than needed

Here’s the broken pattern:

BrokenFixed
Build index + query in the same hot pathBuild once, persist it, reuse it
Large chunks and high top-kSmaller chunks and tighter retrieval
Parallel requests without limitsBound concurrency
# BROKEN: reloading and rebuilding on every request
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

def answer_question(question: str):
    docs = SimpleDirectoryReader("./docs").load_data()
    index = VectorStoreIndex.from_documents(docs)  # expensive and memory-heavy
    query_engine = index.as_query_engine(similarity_top_k=20)
    response = query_engine.query(question)
    return str(response)
# FIXED: build once, persist, and reuse
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

def build_index_once():
    docs = SimpleDirectoryReader("./docs").load_data()
    index = VectorStoreIndex.from_documents(docs)
    index.storage_context.persist(persist_dir="./storage")

def get_query_engine():
    storage_context = StorageContext.from_defaults(persist_dir="./storage")
    index = load_index_from_storage(storage_context)
    return index.as_query_engine(similarity_top_k=5)

query_engine = get_query_engine()

def answer_question(question: str):
    response = query_engine.query(question)
    return str(response)

If you’re using an API server like FastAPI or Flask, rebuilding inside each request handler is the fastest way to trigger OOM under load.

Other Possible Causes

1. Too many concurrent inference calls

If you fan out requests without a cap, memory spikes fast. This is common with async workers or background jobs.

# risky: no concurrency control
results = await asyncio.gather(*[query_engine.aquery(q) for q in questions])

Use a semaphore:

sem = asyncio.Semaphore(2)

async def limited_query(q):
    async with sem:
        return await query_engine.aquery(q)

results = await asyncio.gather(*[limited_query(q) for q in questions])

2. Embedding large batches at once

Bulk ingestion can OOM before inference even starts. SentenceSplitter, embedding models, and vector stores all consume RAM during ingestion.

# risky: huge batch ingestion
nodes = splitter.get_nodes_from_documents(docs)  # massive docs => huge node list
index = VectorStoreIndex(nodes)

Prefer smaller batches:

for batch in batched(docs, 50):
    nodes = splitter.get_nodes_from_documents(batch)
    VectorStoreIndex(nodes).storage_context.persist("./storage")

3. Returning too much context to the LLM

A high similarity_top_k plus large chunks means the prompt balloons. That increases token usage and memory pressure during generation.

query_engine = index.as_query_engine(
    similarity_top_k=20,
    response_mode="compact",
)

Tighten retrieval:

query_engine = index.as_query_engine(
    similarity_top_k=4,
    response_mode="compact",
)

If you need better recall, use reranking instead of just increasing top_k.

4. Holding onto full response objects

Some apps store Response, source_nodes, or entire chat histories in memory for logging or downstream processing.

response = query_engine.query(question)
cache.append(response)  # keeps references alive

Extract only what you need:

response = query_engine.query(question)
cache.append({
    "answer": str(response),
    "sources": [n.node_id for n in response.source_nodes[:3]],
})

How to Debug It

  1. Confirm whether it’s ingestion or inference

    • If memory spikes during from_documents(), your problem is indexing.
    • If it spikes during .query() or .aquery(), it’s inference-time prompt/context growth.
  2. Check your retrieval settings

    • Inspect similarity_top_k, chunk size, and response mode.
    • Start with:
      query_engine = index.as_query_engine(similarity_top_k=3)
      
  3. Measure process memory around each stage

    • Add logging before and after document loading, indexing, retrieval, and generation.
    • Watch RSS with psutil:
      import os, psutil
      print(psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024, "MB")
      
  4. Reduce concurrency to 1

    • If the error disappears at single-threaded execution, you have a scaling problem.
    • Then add back concurrency gradually until you find the limit.

Prevention

  • Persist indexes to disk and load them at runtime instead of rebuilding them per request.
  • Keep chunk sizes and similarity_top_k small unless you have hard evidence they need to be larger.
  • Put a hard cap on concurrent queries in your API worker pool or async pipeline.

If you’re running LlamaIndex in production on Python, treat memory as part of your API contract. The fix is usually not “more RAM” — it’s less work per request and less work happening at once.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides