How to Fix 'OOM error during inference in production' in LangChain (Python)
An OOM error during inference means your Python process ran out of memory while LangChain was building prompts, loading documents, or calling the model. In production, this usually shows up under real traffic: large context windows, batch jobs, too many concurrent requests, or chains that keep accumulating state.
The fix is almost never “buy a bigger machine” first. It’s usually one of a few bad patterns in how you load data, assemble prompts, or manage concurrency.
The Most Common Cause — unbounded context growth
The #1 cause I see is stuffing too much text into the prompt before calling the LLM. In LangChain, this often happens with RetrievalQA, StuffDocumentsChain, or a custom chain that concatenates every retrieved chunk into one giant input.
A typical failure looks like this:
| Broken pattern | Fixed pattern |
|---|---|
| Load too many docs and stuff them all into one prompt | Limit retrieved docs and summarize/compress before generation |
| No token limit on the chain | Enforce chunking and max context size |
# BROKEN: stuffs every retrieved document into the prompt
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import FAISS
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # dangerous for large retrieval sets
retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),
)
result = qa.invoke({"query": "Summarize all policy exceptions"})
# FIXED: reduce context before generation
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="map_reduce", # or "refine" depending on use case
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)
result = qa.invoke({"query": "Summarize all policy exceptions"})
If you are using newer LCEL-style pipelines, the same rule applies: don’t pass raw document dumps directly into ChatPromptTemplate. Summarize first, then answer.
Other Possible Causes
1) Loading a huge model in-process
If you are running local inference with transformers inside the same worker as LangChain, the model itself can blow memory before your chain even starts.
# BAD: loads a large model without memory controls
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meta-llama/Llama-3.1-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
Use quantization, smaller models, or an external inference endpoint.
# BETTER: use a smaller model or quantized loading
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
low_cpu_mem_usage=True,
)
2) Too much concurrency
LangChain itself won’t save you if you fire 100 requests at once from an async worker. Each request may allocate prompt buffers, embeddings payloads, and response objects at the same time.
# BAD: unconstrained fan-out
results = await asyncio.gather(*[
chain.ainvoke({"input": q}) for q in questions
])
Throttle it.
# BETTER: limit concurrency with a semaphore
sem = asyncio.Semaphore(5)
async def bounded_invoke(q):
async with sem:
return await chain.ainvoke({"input": q})
results = await asyncio.gather(*[bounded_invoke(q) for q in questions])
3) Oversized intermediate objects in memory
A common mistake is keeping full PDFs, raw OCR text, embeddings arrays, and chain outputs in Python lists longer than needed. That becomes fatal when your app handles multiple users.
# BAD: keeps everything around
docs = loader.load()
texts = [doc.page_content for doc in docs]
embeddings = embedding_model.embed_documents(texts)
Drop references early and process in chunks.
# BETTER: chunk and stream through the pipeline
for doc in loader.lazy_load():
chunks = splitter.split_text(doc.page_content)
for chunk in chunks:
vectorstore.add_texts([chunk])
4) Returning huge outputs from tools or agents
If your tool returns a massive JSON blob and LangChain injects it back into the next step, memory usage spikes fast. This often happens with agent loops or verbose tool results.
# BAD: tool returns full payload to agent loop
def fetch_claims():
return claims_api.get_all_claims()
Return only what the model needs.
# BETTER: trim tool output before returning
def fetch_claims():
claims = claims_api.get_all_claims()
return [
{"id": c["id"], "status": c["status"], "amount": c["amount"]}
for c in claims[:20]
]
How to Debug It
- •
Check where memory spikes
- •Run the service with process metrics.
- •Watch RSS before and after
invoke(), not just CPU. - •If memory jumps during prompt assembly, it’s usually context bloat.
- •
Print token estimates
- •Count input tokens before calling the model.
- •If you’re near the context window, reduce retrieved docs or summarize them first.
- •With OpenAI-style chat models, large prompts often fail before generation starts.
- •
Disable concurrency temporarily
- •Set request concurrency to
1. - •If OOM disappears, you have a fan-out problem.
- •Reintroduce parallelism with a semaphore or queue limits.
- •Set request concurrency to
- •
Isolate each stage
- •Test loader, splitter, retriever, prompt builder, and LLM call separately.
- •The failure may be in document ingestion rather than inference.
- •A good split is:
- •load documents only
- •build embeddings only
- •run retrieval only
- •invoke LLM only
Prevention
- •
Use smaller retrieval sets by default:
- •
search_kwargs={"k": 3}is a sane starting point. - •Increase only when you have measured token cost and latency.
- •
- •
Prefer map-reduce or refine chains for large corpora:
- •Avoid
chain_type="stuff"unless documents are already small. - •Summarize before final answer generation.
- •Avoid
- •
Put hard limits around request size and concurrency:
- •Cap input characters per request.
- •Cap concurrent invocations per worker.
- •Reject oversized uploads early instead of trying to process them downstream.
If you’re seeing CUDA out of memory, OOMKilled, or Python process termination during LangChain inference, start by checking prompt size and parallelism first. In production systems I’ve seen those two account for most failures long before model choice becomes the issue.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit