How to Fix 'OOM error during inference' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21

oom-error-during-inferencelangchainpython

When you see OOM error during inference in a LangChain Python app, it usually means the model process ran out of memory while generating a response. In practice, this shows up with local LLMs, large prompts, long chat histories, or when you try to run too many concurrent requests on the same machine.

The stack trace often points at the model backend rather than LangChain itself. LangChain is usually the layer that assembled the prompt or triggered parallel execution that pushed memory over the edge.

The Most Common Cause

The #1 cause is sending too much context into the model: oversized chat history, long retrieved documents, or both. With ChatOpenAI, Ollama, HuggingFacePipeline, or LlamaCpp, the actual failure often happens after LangChain builds a huge prompt and hands it off to inference.

Here’s the broken pattern I see most often:

Broken	Fixed
Passing full conversation history and all retrieved docs	Trimming history and limiting retrieved chunks

# Broken: unbounded context growth
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("human", "{question}\n\nConversation:\n{history}\n\nDocs:\n{docs}")
])

chain = prompt | llm

result = chain.invoke({
    "question": "Summarize the claim",
    "history": "\n".join(huge_chat_history),   # grows forever
    "docs": "\n\n".join(all_retrieved_docs),    # too many chunks
})

# Fixed: cap history and retrieved context
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import trim_messages

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("human", "{question}\n\nConversation:\n{history}\n\nDocs:\n{docs}")
])

def trim_docs(docs, max_docs=4):
    return docs[:max_docs]

trimmed_history = trim_messages(
    huge_chat_history,
    max_tokens=2000,
    strategy="last",
    token_counter=llm
)

result = (prompt | llm).invoke({
    "question": "Summarize the claim",
    "history": trimmed_history,
    "docs": "\n\n".join(trim_docs(retrieved_docs)),
})

If you’re using retrieval, don’t dump every chunk into the prompt. Use fewer chunks, smaller chunk sizes, and a hard token budget.

Other Possible Causes

1) Model too large for available VRAM/RAM

This is common with local inference backends like LlamaCpp or Hugging Face pipelines. The error may look like:

•RuntimeError: CUDA out of memory
•torch.cuda.OutOfMemoryError
•llama.cpp: failed to allocate buffer

# Problematic: model doesn't fit on your GPU
from langchain_community.llms import LlamaCpp

llm = LlamaCpp(
    model_path="/models/large-model.gguf",
    n_gpu_layers=-1,
)

# Better: use quantization or fewer GPU layers
llm = LlamaCpp(
    model_path="/models/7b-q4.gguf",
    n_gpu_layers=20,
)

2) Batch size or concurrency is too high

LangChain can fan out calls through parallel map/retrieval chains. If multiple generations happen at once, memory spikes fast.

# Problematic: too much parallelism
results = chain.batch(inputs, config={"max_concurrency": 16})

# Better: reduce concurrency
results = chain.batch(inputs, config={"max_concurrency": 2})

If you’re calling an embedding model in parallel before generation, the same issue applies there too.

3) Prompt inflation from recursive chains

Agents and retrieval chains can accidentally re-inject their own outputs back into context. That creates runaway prompt growth.

# Problematic: feeding prior answer back into docs/history repeatedly
state["context"] += f"\nPrevious answer: {answer}"

Fix it by storing only compact state:

state["context"] = {
    "summary": summary,
    "top_facts": top_facts[:5],
}

4) Inference backend buffer settings are too aggressive

Some backends reserve extra memory for KV cache or preallocation. With local models, this can trigger OOM even when the prompt looks reasonable.

from langchain_community.llms import LlamaCpp

llm = LlamaCpp(
    model_path="/models/7b-q4.gguf",
    n_ctx=8192,      # too large for your box
)

Try lowering context window:

llm = LlamaCpp(
    model_path="/models/7b-q4.gguf",
    n_ctx=2048,
)

How to Debug It

•
Check whether the failure happens before or during generation.
If it fails while formatting prompts or retrieving docs, your context assembly is too large. If it fails inside model.generate() or backend logs show allocator errors, it’s raw inference memory.

•

Log prompt size before calling the model.
Measure tokens, not just characters.

from tiktoken import get_encoding

enc = get_encoding("cl100k_base")
token_count = len(enc.encode(prompt_text))
print("prompt_tokens =", token_count)

•
Turn off parallelism and retry with one request.
Set max_concurrency=1, remove .batch(), and test a single input. If OOM disappears, concurrency was the trigger.
•
Reduce variables one by one.
Test in this order:
- •no chat history
- •no retrieved docs
- •smaller model
- •lower context window / batch size
The first change that fixes it tells you where the memory spike lives.

Prevention

•
Put hard limits on context:
- •cap retrieved chunks
- •trim chat history
- •summarize older turns instead of appending them forever
•
Match model size to hardware:
- •use quantized local models on small machines
- •lower n_ctx, batch size, and concurrency before production traffic hits
•
Add guardrails in code:
- •log token counts per request
- •reject oversized prompts early with a clear error message instead of letting inference crash

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit