How to Fix 'OOM error during inference' in LangChain (Python)
When you see OOM error during inference in a LangChain Python app, it usually means the model process ran out of memory while generating a response. In practice, this shows up with local LLMs, large prompts, long chat histories, or when you try to run too many concurrent requests on the same machine.
The stack trace often points at the model backend rather than LangChain itself. LangChain is usually the layer that assembled the prompt or triggered parallel execution that pushed memory over the edge.
The Most Common Cause
The #1 cause is sending too much context into the model: oversized chat history, long retrieved documents, or both. With ChatOpenAI, Ollama, HuggingFacePipeline, or LlamaCpp, the actual failure often happens after LangChain builds a huge prompt and hands it off to inference.
Here’s the broken pattern I see most often:
| Broken | Fixed |
|---|---|
| Passing full conversation history and all retrieved docs | Trimming history and limiting retrieved chunks |
# Broken: unbounded context growth
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("human", "{question}\n\nConversation:\n{history}\n\nDocs:\n{docs}")
])
chain = prompt | llm
result = chain.invoke({
"question": "Summarize the claim",
"history": "\n".join(huge_chat_history), # grows forever
"docs": "\n\n".join(all_retrieved_docs), # too many chunks
})
# Fixed: cap history and retrieved context
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import trim_messages
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("human", "{question}\n\nConversation:\n{history}\n\nDocs:\n{docs}")
])
def trim_docs(docs, max_docs=4):
return docs[:max_docs]
trimmed_history = trim_messages(
huge_chat_history,
max_tokens=2000,
strategy="last",
token_counter=llm
)
result = (prompt | llm).invoke({
"question": "Summarize the claim",
"history": trimmed_history,
"docs": "\n\n".join(trim_docs(retrieved_docs)),
})
If you’re using retrieval, don’t dump every chunk into the prompt. Use fewer chunks, smaller chunk sizes, and a hard token budget.
Other Possible Causes
1) Model too large for available VRAM/RAM
This is common with local inference backends like LlamaCpp or Hugging Face pipelines. The error may look like:
- •
RuntimeError: CUDA out of memory - •
torch.cuda.OutOfMemoryError - •
llama.cpp: failed to allocate buffer
# Problematic: model doesn't fit on your GPU
from langchain_community.llms import LlamaCpp
llm = LlamaCpp(
model_path="/models/large-model.gguf",
n_gpu_layers=-1,
)
# Better: use quantization or fewer GPU layers
llm = LlamaCpp(
model_path="/models/7b-q4.gguf",
n_gpu_layers=20,
)
2) Batch size or concurrency is too high
LangChain can fan out calls through parallel map/retrieval chains. If multiple generations happen at once, memory spikes fast.
# Problematic: too much parallelism
results = chain.batch(inputs, config={"max_concurrency": 16})
# Better: reduce concurrency
results = chain.batch(inputs, config={"max_concurrency": 2})
If you’re calling an embedding model in parallel before generation, the same issue applies there too.
3) Prompt inflation from recursive chains
Agents and retrieval chains can accidentally re-inject their own outputs back into context. That creates runaway prompt growth.
# Problematic: feeding prior answer back into docs/history repeatedly
state["context"] += f"\nPrevious answer: {answer}"
Fix it by storing only compact state:
state["context"] = {
"summary": summary,
"top_facts": top_facts[:5],
}
4) Inference backend buffer settings are too aggressive
Some backends reserve extra memory for KV cache or preallocation. With local models, this can trigger OOM even when the prompt looks reasonable.
from langchain_community.llms import LlamaCpp
llm = LlamaCpp(
model_path="/models/7b-q4.gguf",
n_ctx=8192, # too large for your box
)
Try lowering context window:
llm = LlamaCpp(
model_path="/models/7b-q4.gguf",
n_ctx=2048,
)
How to Debug It
- •
Check whether the failure happens before or during generation.
If it fails while formatting prompts or retrieving docs, your context assembly is too large. If it fails insidemodel.generate()or backend logs show allocator errors, it’s raw inference memory. - •
Log prompt size before calling the model.
Measure tokens, not just characters.from tiktoken import get_encoding enc = get_encoding("cl100k_base") token_count = len(enc.encode(prompt_text)) print("prompt_tokens =", token_count) - •
Turn off parallelism and retry with one request.
Setmax_concurrency=1, remove.batch(), and test a single input. If OOM disappears, concurrency was the trigger. - •
Reduce variables one by one.
Test in this order:- •no chat history
- •no retrieved docs
- •smaller model
- •lower context window / batch size
The first change that fixes it tells you where the memory spike lives.
Prevention
- •
Put hard limits on context:
- •cap retrieved chunks
- •trim chat history
- •summarize older turns instead of appending them forever
- •
Match model size to hardware:
- •use quantized local models on small machines
- •lower
n_ctx, batch size, and concurrency before production traffic hits
- •
Add guardrails in code:
- •log token counts per request
- •reject oversized prompts early with a clear error message instead of letting inference crash
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit