How to Fix 'OOM error during inference' in LlamaIndex (Python)
If you’re seeing OOM error during inference in LlamaIndex, your process is running out of memory while the model is generating a response or embedding text. In practice, this usually shows up when you feed too much context into a local LLM, load a model that’s too large for your GPU/CPU, or let retrieval return far more chunks than the model can handle.
The fix is usually not “increase RAM” first. It’s almost always about reducing prompt size, controlling retrieval, or using a smaller model/runtime configuration.
The Most Common Cause
The #1 cause is stuffing too much text into the LLM context window during query_engine.query() or chat_engine.chat(). With LlamaIndex, this often happens when SimilarityPostprocessor is missing, similarity_top_k is too high, or you’re using a large CompactAndRefine-style synthesis path on long documents.
Here’s the broken pattern versus the fixed one:
| Broken | Fixed |
|---|---|
| Retrieves too many nodes and sends them all to inference | Limits retrieval and trims context before synthesis |
| Uses default settings blindly | Sets explicit chunking and top-k limits |
# BROKEN
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(similarity_top_k=20)
response = query_engine.query("Summarize the policy exclusions and claim limits.")
print(response)
# FIXED
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.indices.postprocessor import SimilarityPostprocessor
docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(
similarity_top_k=4,
node_postprocessors=[
SimilarityPostprocessor(similarity_cutoff=0.75)
],
)
response = query_engine.query("Summarize the policy exclusions and claim limits.")
print(response)
If you’re using a local model through llama_index.llms.ollama.Ollama, llama_index.llms.huggingface.HuggingFaceLLM, or another backend, the same issue applies: too many tokens in means memory spikes during inference.
Other Possible Causes
1) Your chunk size is too large
Large chunks create huge embeddings and oversized prompt payloads.
# Too large
from llama_index.core import Settings
Settings.chunk_size = 4096
Settings.chunk_overlap = 512
Use smaller chunks for retrieval-heavy workloads:
from llama_index.core import Settings
Settings.chunk_size = 512
Settings.chunk_overlap = 64
2) You loaded a model that does not fit your hardware
This is common with local inference backends. A 7B+ model in full precision can blow up GPU memory fast.
# Example: too aggressive for limited VRAM
from llama_index.llms.huggingface import HuggingFaceLLM
llm = HuggingFaceLLM(
model_name="meta-llama/Llama-3.1-8B-Instruct",
device_map="cuda",
)
Safer configuration:
llm = HuggingFaceLLM(
model_name="meta-llama/Llama-3.1-8B-Instruct",
device_map="auto",
model_kwargs={"torch_dtype": "float16"},
)
If your stack supports quantization, use it.
3) You are returning too many source nodes in the response
This doesn’t just affect display. Some response modes keep extra context around during synthesis.
query_engine = index.as_query_engine(
similarity_top_k=10,
response_mode="tree_summarize",
verbose=True,
)
Try a smaller top-k and a simpler mode:
query_engine = index.as_query_engine(
similarity_top_k=3,
response_mode="compact",
)
4) You are indexing huge documents without preprocessing
A single PDF or contract dump can create thousands of nodes if you don’t split it properly.
from llama_index.core.node_parser import SentenceSplitter
parser = SentenceSplitter(chunk_size=256, chunk_overlap=32)
nodes = parser.get_nodes_from_documents(docs)
index = VectorStoreIndex(nodes)
Also remove boilerplate like headers, footers, repeated disclaimers, and OCR noise before indexing.
How to Debug It
- •
Check where the failure happens
- •If it crashes during
index.as_query_engine().query(...), it’s likely prompt/context size. - •If it crashes while loading the model, it’s model memory.
- •If it crashes during embedding/indexing, it’s document size or batch size.
- •If it crashes during
- •
Print retrieved node counts
print(len(query_engine.retrieve("your question")))If this number is high, reduce
similarity_top_kand add a cutoff postprocessor. - •
Inspect token usage
- •Log prompt length before calling the LLM.
- •In local setups, watch GPU memory with
nvidia-smi. - •In CPU setups, watch RSS with
htoporps.
- •
Reduce one variable at a time
- •Cut
similarity_top_kfrom 10 to 3. - •Reduce
chunk_sizefrom 1024 to 256. - •Switch to a smaller model.
- •Disable fancy response modes like
tree_summarize.
- •Cut
A real error message often looks like this:
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.79 GiB total capacity; 6.91 GiB already allocated)
Or on CPU-bound inference:
MemoryError: OOM error during inference in llm.predict()
When you see either one, stop guessing and isolate the stage that consumes memory.
Prevention
- •Keep retrieval tight:
- •Start with
similarity_top_k=3or4 - •Add
SimilarityPostprocessor(similarity_cutoff=...)
- •Start with
- •Use sane ingestion defaults:
- •Chunk at
256–512tokens for RAG workloads - •Strip boilerplate before indexing
- •Chunk at
- •Match model size to hardware:
- •Use quantized or smaller models on local machines
- •Don’t run an 8B+ model in full precision on weak VRAM
If you build RAG systems in production, treat memory as part of your API contract. The moment your retrieval layer stops respecting token budgets, inference will fail long before your app logic does.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit