How to Fix 'OOM error during inference' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inferencellamaindexpython

If you’re seeing OOM error during inference in LlamaIndex, your process is running out of memory while the model is generating a response or embedding text. In practice, this usually shows up when you feed too much context into a local LLM, load a model that’s too large for your GPU/CPU, or let retrieval return far more chunks than the model can handle.

The fix is usually not “increase RAM” first. It’s almost always about reducing prompt size, controlling retrieval, or using a smaller model/runtime configuration.

The Most Common Cause

The #1 cause is stuffing too much text into the LLM context window during query_engine.query() or chat_engine.chat(). With LlamaIndex, this often happens when SimilarityPostprocessor is missing, similarity_top_k is too high, or you’re using a large CompactAndRefine-style synthesis path on long documents.

Here’s the broken pattern versus the fixed one:

BrokenFixed
Retrieves too many nodes and sends them all to inferenceLimits retrieval and trims context before synthesis
Uses default settings blindlySets explicit chunking and top-k limits
# BROKEN
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine(similarity_top_k=20)
response = query_engine.query("Summarize the policy exclusions and claim limits.")
print(response)
# FIXED
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.indices.postprocessor import SimilarityPostprocessor

docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine(
    similarity_top_k=4,
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.75)
    ],
)

response = query_engine.query("Summarize the policy exclusions and claim limits.")
print(response)

If you’re using a local model through llama_index.llms.ollama.Ollama, llama_index.llms.huggingface.HuggingFaceLLM, or another backend, the same issue applies: too many tokens in means memory spikes during inference.

Other Possible Causes

1) Your chunk size is too large

Large chunks create huge embeddings and oversized prompt payloads.

# Too large
from llama_index.core import Settings
Settings.chunk_size = 4096
Settings.chunk_overlap = 512

Use smaller chunks for retrieval-heavy workloads:

from llama_index.core import Settings
Settings.chunk_size = 512
Settings.chunk_overlap = 64

2) You loaded a model that does not fit your hardware

This is common with local inference backends. A 7B+ model in full precision can blow up GPU memory fast.

# Example: too aggressive for limited VRAM
from llama_index.llms.huggingface import HuggingFaceLLM

llm = HuggingFaceLLM(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    device_map="cuda",
)

Safer configuration:

llm = HuggingFaceLLM(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    device_map="auto",
    model_kwargs={"torch_dtype": "float16"},
)

If your stack supports quantization, use it.

3) You are returning too many source nodes in the response

This doesn’t just affect display. Some response modes keep extra context around during synthesis.

query_engine = index.as_query_engine(
    similarity_top_k=10,
    response_mode="tree_summarize",
    verbose=True,
)

Try a smaller top-k and a simpler mode:

query_engine = index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact",
)

4) You are indexing huge documents without preprocessing

A single PDF or contract dump can create thousands of nodes if you don’t split it properly.

from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(chunk_size=256, chunk_overlap=32)
nodes = parser.get_nodes_from_documents(docs)
index = VectorStoreIndex(nodes)

Also remove boilerplate like headers, footers, repeated disclaimers, and OCR noise before indexing.

How to Debug It

  1. Check where the failure happens

    • If it crashes during index.as_query_engine().query(...), it’s likely prompt/context size.
    • If it crashes while loading the model, it’s model memory.
    • If it crashes during embedding/indexing, it’s document size or batch size.
  2. Print retrieved node counts

    print(len(query_engine.retrieve("your question")))
    

    If this number is high, reduce similarity_top_k and add a cutoff postprocessor.

  3. Inspect token usage

    • Log prompt length before calling the LLM.
    • In local setups, watch GPU memory with nvidia-smi.
    • In CPU setups, watch RSS with htop or ps.
  4. Reduce one variable at a time

    • Cut similarity_top_k from 10 to 3.
    • Reduce chunk_size from 1024 to 256.
    • Switch to a smaller model.
    • Disable fancy response modes like tree_summarize.

A real error message often looks like this:

RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.79 GiB total capacity; 6.91 GiB already allocated)

Or on CPU-bound inference:

MemoryError: OOM error during inference in llm.predict()

When you see either one, stop guessing and isolate the stage that consumes memory.

Prevention

  • Keep retrieval tight:
    • Start with similarity_top_k=3 or 4
    • Add SimilarityPostprocessor(similarity_cutoff=...)
  • Use sane ingestion defaults:
    • Chunk at 256–512 tokens for RAG workloads
    • Strip boilerplate before indexing
  • Match model size to hardware:
    • Use quantized or smaller models on local machines
    • Don’t run an 8B+ model in full precision on weak VRAM

If you build RAG systems in production, treat memory as part of your API contract. The moment your retrieval layer stops respecting token budgets, inference will fail long before your app logic does.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides