How to Fix 'OOM error during inference during development' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inference-during-developmentllamaindexpython

An OOM error during inference during development usually means your Python process ran out of memory while LlamaIndex was loading a model, embedding a large batch, or keeping too much text in RAM. In practice, this shows up when you run local inference with a big LLM, chunk too much data at once, or accidentally keep multiple copies of your index and embeddings in memory.

The stack trace often ends in something like:

  • RuntimeError: CUDA out of memory
  • MemoryError
  • torch.OutOfMemoryError
  • ValueError: Failed to load model
  • llama_index.core.indices.vector_store.VectorStoreIndex

The Most Common Cause

The #1 cause is building the index from too much text at once and using a heavyweight embedding/LLM setup in the same process.

A common bad pattern is loading all documents, chunking them aggressively, then calling .from_documents() with default settings that create large batches of embeddings.

Broken patternFixed pattern
Loads everything into memory at onceStreams or batches documents
Uses default embed batch sizesUses smaller batch sizes
Keeps a large local model alive during indexingSeparates indexing from inference
No memory controlsExplicit chunking and model limits
# BROKEN
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("./data").load_data()

# This can blow up memory if docs are large
index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine()
response = query_engine.query("Summarize the policy changes")
print(response)
# FIXED
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    embed_batch_size=8,
)

docs = SimpleDirectoryReader("./data").load_data()

# If your corpus is large, process fewer docs per run
index = VectorStoreIndex.from_documents(docs[:20])

query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("Summarize the policy changes")
print(response)

If you are using a local model through llama_index.llms.ollama, llama_index.llms.huggingface, or a custom LLM, the same issue applies. The model plus embeddings plus document chunks can exceed RAM or VRAM before you even get to inference.

Other Possible Causes

1) Your chunk size is too large

Large chunks create fewer nodes, but each node becomes expensive to embed and store. That hurts both indexing and retrieval.

from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(chunk_size=2048, chunk_overlap=200)  # risky for dev machines

Use smaller chunks:

from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)

2) Your local LLM is too big for your machine

If you are running a local Hugging Face model or quantized Llama variant, the model itself may consume most available memory. Then inference fails as soon as prompt context grows.

from llama_index.llms.huggingface import HuggingFaceLLM

llm = HuggingFaceLLM(
    model_name="meta-llama/Llama-2-13b-chat-hf"
)

Try a smaller model:

from llama_index.llms.huggingface import HuggingFaceLLM

llm = HuggingFaceLLM(
    model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"
)

3) You are retrieving too many nodes per query

A high similarity_top_k can flood the prompt with context. That increases token count and can trigger OOM during response generation.

query_engine = index.as_query_engine(similarity_top_k=20)

Reduce it:

query_engine = index.as_query_engine(similarity_top_k=3)

4) You are keeping multiple indexes or readers alive in the same process

This happens in notebooks and long-running dev scripts. If you rebuild indexes repeatedly without clearing references, memory usage climbs until the process dies.

indexes = []
for path in paths:
    docs = SimpleDirectoryReader(path).load_data()
    indexes.append(VectorStoreIndex.from_documents(docs))

Fix it by processing one dataset at a time:

for path in paths:
    docs = SimpleDirectoryReader(path).load_data()
    index = VectorStoreIndex.from_documents(docs)
    del docs
    del index

How to Debug It

  1. Check whether the failure happens at indexing or querying

    • If it crashes on VectorStoreIndex.from_documents(), the problem is embeddings/chunking.
    • If it crashes on query_engine.query(...), the problem is prompt size or LLM memory.
  2. Print your effective settings

    • Log chunk size, overlap, top-k, embed batch size, and model name.
    • Most OOM bugs are configuration bugs.
  3. Run with smaller inputs

    • Try one document.
    • Then five documents.
    • Then one query with similarity_top_k=1.
    • The first step that fails tells you where memory spikes.
  4. Watch process memory directly

    • Use htop, Task Manager, Activity Monitor, or nvidia-smi.
    • If RAM climbs during indexing, it’s embeddings or parsing.
    • If VRAM spikes during generation, it’s the LLM context window or model size.

Prevention

  • Keep embedding batches small:
    • Start with embed_batch_size=8 or lower on dev machines.
  • Use conservative retrieval defaults:
    • chunk_size=512
    • chunk_overlap=50
    • similarity_top_k=3
  • Separate pipelines:
    • Build indexes offline.
    • Run inference in a separate process if possible.

If you are seeing this error inside LlamaIndex specifically, treat it as a capacity problem first, not an abstract framework bug. In almost every case I’ve seen, the fix is to reduce batch size, reduce context size, or stop loading everything into memory at once.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides