How to Fix 'OOM error during inference in production' in LlamaIndex (Python)
When you see OOM error during inference in production with LlamaIndex, it means your process ran out of memory while building embeddings, loading a model, or generating a response. In practice, this usually shows up under real traffic when a query pulls too much context, a batch is too large, or the model is loaded in an inefficient way.
The failure often appears as a Python MemoryError, a CUDA out-of-memory exception, or a process kill from the OS. With LlamaIndex, the stack trace usually points at RetrieverQueryEngine, ResponseSynthesizer, OpenAIEmbedding, HuggingFaceEmbedding, or your local LLM wrapper.
The Most Common Cause
The #1 cause is feeding too much context into the LLM at once.
In LlamaIndex, this usually happens when:
- •you retrieve too many nodes
- •your chunk size is too large
- •you use a synthesis mode that stuffs everything into one prompt
- •you keep long conversation history in memory
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Retrieve 20+ nodes and stuff them into one prompt | Limit retrieval and use compact synthesis |
| Large chunks like 4096+ tokens | Smaller chunks like 512-1024 tokens |
| Default “stuff” behavior for long docs | Use compact or tree_summarize |
# BROKEN: too much context stuffed into one inference call
from llama_index.core import VectorStoreIndex
from llama_index.core.query_engine import RetrieverQueryEngine
index = VectorStoreIndex.from_documents(docs)
query_engine = RetrieverQueryEngine.from_args(
index.as_retriever(similarity_top_k=20), # too many nodes
response_mode="compact" # still can blow up if each chunk is huge
)
response = query_engine.query(
"Summarize all policy exceptions and edge cases."
)
print(response)
# FIXED: reduce retrieved context and control synthesis
from llama_index.core import VectorStoreIndex
from llama_index.core.query_engine import RetrieverQueryEngine
index = VectorStoreIndex.from_documents(docs)
retriever = index.as_retriever(similarity_top_k=5)
query_engine = RetrieverQueryEngine.from_args(
retriever,
response_mode="compact"
)
response = query_engine.query(
"Summarize the main policy exceptions."
)
print(response)
If you are using document ingestion, also fix chunking at the source:
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(chunk_size=768, chunk_overlap=100)
nodes = splitter.get_nodes_from_documents(docs)
If your chunks are huge, every retrieval multiplies memory usage during prompt assembly and inference.
Other Possible Causes
1. Loading a large local model on the wrong device
If you run a Hugging Face model on CPU with no quantization, or place it on GPU without enough VRAM, inference will fail fast.
# risky on small instances
from llama_index.llms.huggingface import HuggingFaceLLM
llm = HuggingFaceLLM(
model_name="meta-llama/Llama-3.1-8B-Instruct",
device_map="cuda"
)
Use smaller models, quantization, or CPU fallback:
llm = HuggingFaceLLM(
model_name="meta-llama/Llama-3.1-8B-Instruct",
device_map="auto",
model_kwargs={"load_in_4bit": True}
)
2. Embedding too many documents in one batch
A common ingestion-time OOM happens inside OpenAIEmbedding or HuggingFaceEmbedding when batching is too aggressive.
# can spike memory during indexing
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")
Reduce batch size if your embedding backend supports it:
embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5",
embed_batch_size=8
)
If you’re indexing millions of nodes, do not build everything in one process.
3. Querying with long chat history
If you pass every prior message into the prompt buffer, token count and memory grow until inference breaks.
# bad: unbounded chat history accumulation
chat_history.append(user_msg)
chat_history.append(assistant_msg)
response = chat_engine.chat(chat_history)
Trim history before sending it to LlamaIndex:
chat_history = chat_history[-6:] # keep last few turns only
response = chat_engine.chat(user_msg, chat_history=chat_history)
4. Response mode that expands tokens aggressively
Some synthesis modes are more expensive than others. refine and large-context prompts can multiply memory usage across retrieved nodes.
query_engine = index.as_query_engine(response_mode="refine")
Try a cheaper mode first:
query_engine = index.as_query_engine(response_mode="compact")
For very long documents, tree_summarize is often safer than stuffing everything into one pass.
How to Debug It
- •
Check where the crash happens
- •If it fails during indexing, look at embedding batch size and chunking.
- •If it fails during querying, look at retrieval size and response mode.
- •If it fails after loading the model, inspect VRAM/RAM usage.
- •
Print token and node counts
- •Log how many nodes are being retrieved.
- •Log prompt length before calling the LLM.
- •If you see dozens of nodes or multi-thousand-token prompts, that’s your problem.
- •
Run with smaller limits
- •Set
similarity_top_k=3 - •Reduce chunk size to 512–768 tokens
- •Switch to
response_mode="compact" - •If the error disappears, you’ve confirmed context explosion.
- •Set
- •
Watch actual memory usage
- •Use
htop,free -m, Docker limits, or GPU tools likenvidia-smi - •If memory climbs steadily during ingestion, it’s batching or document volume
- •If it spikes only on query time, it’s prompt assembly or model size
- •Use
Prevention
- •
Keep retrieval bounded:
- •Start with
similarity_top_k=3to5 - •Only increase if evaluation proves you need more context
- •Start with
- •
Control chunking early:
- •Use smaller chunks for production RAG workloads
- •Avoid giant source chunks that turn every query into a memory event
- •
Pick models that fit your deployment target:
- •Match model size to available RAM/VRAM
- •Use quantized local models when running on constrained infrastructure
If you want a quick rule: most LlamaIndex OOMs are not “LLM bugs.” They’re context management bugs. Cut the prompt size first, then tune batching and model footprint.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit