How to Fix 'OOM error during inference in production' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inference-in-productionllamaindexpython

When you see OOM error during inference in production with LlamaIndex, it means your process ran out of memory while building embeddings, loading a model, or generating a response. In practice, this usually shows up under real traffic when a query pulls too much context, a batch is too large, or the model is loaded in an inefficient way.

The failure often appears as a Python MemoryError, a CUDA out-of-memory exception, or a process kill from the OS. With LlamaIndex, the stack trace usually points at RetrieverQueryEngine, ResponseSynthesizer, OpenAIEmbedding, HuggingFaceEmbedding, or your local LLM wrapper.

The Most Common Cause

The #1 cause is feeding too much context into the LLM at once.

In LlamaIndex, this usually happens when:

  • you retrieve too many nodes
  • your chunk size is too large
  • you use a synthesis mode that stuffs everything into one prompt
  • you keep long conversation history in memory

Here’s the broken pattern:

BrokenFixed
Retrieve 20+ nodes and stuff them into one promptLimit retrieval and use compact synthesis
Large chunks like 4096+ tokensSmaller chunks like 512-1024 tokens
Default “stuff” behavior for long docsUse compact or tree_summarize
# BROKEN: too much context stuffed into one inference call
from llama_index.core import VectorStoreIndex
from llama_index.core.query_engine import RetrieverQueryEngine

index = VectorStoreIndex.from_documents(docs)

query_engine = RetrieverQueryEngine.from_args(
    index.as_retriever(similarity_top_k=20),  # too many nodes
    response_mode="compact"  # still can blow up if each chunk is huge
)

response = query_engine.query(
    "Summarize all policy exceptions and edge cases."
)
print(response)
# FIXED: reduce retrieved context and control synthesis
from llama_index.core import VectorStoreIndex
from llama_index.core.query_engine import RetrieverQueryEngine

index = VectorStoreIndex.from_documents(docs)

retriever = index.as_retriever(similarity_top_k=5)

query_engine = RetrieverQueryEngine.from_args(
    retriever,
    response_mode="compact"
)

response = query_engine.query(
    "Summarize the main policy exceptions."
)
print(response)

If you are using document ingestion, also fix chunking at the source:

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=768, chunk_overlap=100)
nodes = splitter.get_nodes_from_documents(docs)

If your chunks are huge, every retrieval multiplies memory usage during prompt assembly and inference.

Other Possible Causes

1. Loading a large local model on the wrong device

If you run a Hugging Face model on CPU with no quantization, or place it on GPU without enough VRAM, inference will fail fast.

# risky on small instances
from llama_index.llms.huggingface import HuggingFaceLLM

llm = HuggingFaceLLM(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    device_map="cuda"
)

Use smaller models, quantization, or CPU fallback:

llm = HuggingFaceLLM(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    device_map="auto",
    model_kwargs={"load_in_4bit": True}
)

2. Embedding too many documents in one batch

A common ingestion-time OOM happens inside OpenAIEmbedding or HuggingFaceEmbedding when batching is too aggressive.

# can spike memory during indexing
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")

Reduce batch size if your embedding backend supports it:

embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    embed_batch_size=8
)

If you’re indexing millions of nodes, do not build everything in one process.

3. Querying with long chat history

If you pass every prior message into the prompt buffer, token count and memory grow until inference breaks.

# bad: unbounded chat history accumulation
chat_history.append(user_msg)
chat_history.append(assistant_msg)
response = chat_engine.chat(chat_history)

Trim history before sending it to LlamaIndex:

chat_history = chat_history[-6:]  # keep last few turns only
response = chat_engine.chat(user_msg, chat_history=chat_history)

4. Response mode that expands tokens aggressively

Some synthesis modes are more expensive than others. refine and large-context prompts can multiply memory usage across retrieved nodes.

query_engine = index.as_query_engine(response_mode="refine")

Try a cheaper mode first:

query_engine = index.as_query_engine(response_mode="compact")

For very long documents, tree_summarize is often safer than stuffing everything into one pass.

How to Debug It

  1. Check where the crash happens

    • If it fails during indexing, look at embedding batch size and chunking.
    • If it fails during querying, look at retrieval size and response mode.
    • If it fails after loading the model, inspect VRAM/RAM usage.
  2. Print token and node counts

    • Log how many nodes are being retrieved.
    • Log prompt length before calling the LLM.
    • If you see dozens of nodes or multi-thousand-token prompts, that’s your problem.
  3. Run with smaller limits

    • Set similarity_top_k=3
    • Reduce chunk size to 512–768 tokens
    • Switch to response_mode="compact"
    • If the error disappears, you’ve confirmed context explosion.
  4. Watch actual memory usage

    • Use htop, free -m, Docker limits, or GPU tools like nvidia-smi
    • If memory climbs steadily during ingestion, it’s batching or document volume
    • If it spikes only on query time, it’s prompt assembly or model size

Prevention

  • Keep retrieval bounded:

    • Start with similarity_top_k=3 to 5
    • Only increase if evaluation proves you need more context
  • Control chunking early:

    • Use smaller chunks for production RAG workloads
    • Avoid giant source chunks that turn every query into a memory event
  • Pick models that fit your deployment target:

    • Match model size to available RAM/VRAM
    • Use quantized local models when running on constrained infrastructure

If you want a quick rule: most LlamaIndex OOMs are not “LLM bugs.” They’re context management bugs. Cut the prompt size first, then tune batching and model footprint.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides