How to Fix 'OOM error during inference when scaling' in LangChain (Python)
When you see OOM error during inference when scaling in a LangChain Python app, it usually means your process ran out of memory while the model was being called, not while the chain was being built. In practice, this shows up when traffic increases, prompts get larger, or you accidentally keep too many model objects and intermediate outputs alive.
The failure often appears with local models, GPU-backed inference, or batch-heavy chains. You may also see related runtime errors like CUDA out of memory, MemoryError, or a worker getting killed by the OS after LangChain starts fanning out requests.
The Most Common Cause
The #1 cause is uncontrolled concurrency. A chain that works for one request can blow up under load when multiple invocations hit the same model instance at once, especially with large context windows or local LLMs.
The broken pattern is usually a RunnableParallel, batch(), async fan-out, or a web endpoint that lets too many requests through at once.
| Broken pattern | Fixed pattern |
|---|---|
| Fire off many generations at once | Limit concurrency and batch size |
| Reuse a heavy local model without backpressure | Serialize or throttle inference |
| Keep full chat history in every request | Trim memory before calling the LLM |
# Broken: unbounded parallel inference
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms import Ollama
from langchain_core.runnables import RunnableLambda
llm = Ollama(model="llama3")
prompt = ChatPromptTemplate.from_template("Summarize this text:\n{text}")
chain = prompt | llm
texts = [{"text": t} for t in huge_text_list]
# This can trigger:
# RuntimeError: CUDA out of memory
# or the process gets killed under load
results = chain.batch(texts)
# Fixed: cap concurrency and reduce payload size
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms import Ollama
llm = Ollama(model="llama3")
prompt = ChatPromptTemplate.from_template("Summarize this text:\n{text}")
chain = prompt | llm
texts = [{"text": t[:4000]} for t in huge_text_list] # trim input early
results = chain.batch(
texts,
config={"max_concurrency": 2} # backpressure
)
If you are using FastAPI, Celery, or any worker pool, the same rule applies. Two workers with a 7B model may be fine; twenty concurrent requests on the same box will not be.
Other Possible Causes
1) Prompt growth from chat history
LangChain memory can quietly accumulate token load across turns. If you keep appending full conversation history into every call, each request gets more expensive than the last.
# Problematic: unbounded conversation growth
history = "\n".join(messages)
response = chain.invoke({
"question": user_input,
"history": history,
})
# Better: trim history before passing it to the chain
from langchain_core.messages import trim_messages
trimmed_messages = trim_messages(
messages,
max_tokens=2000,
strategy="last",
)
response = chain.invoke({
"question": user_input,
"history": trimmed_messages,
})
2) Loading the model inside the request path
If you instantiate the model on every request, memory fragments fast and peak usage spikes. This is common with local embeddings or LLM wrappers inside Flask/FastAPI endpoints.
# Bad: reloads model every request
@app.post("/infer")
def infer(payload: dict):
llm = HuggingFacePipeline.from_model_id(...)
return llm.invoke(payload["text"])
# Better: create once at startup and reuse
llm = HuggingFacePipeline.from_model_id(...)
@app.post("/infer")
def infer(payload: dict):
return llm.invoke(payload["text"])
3) Returning giant intermediate outputs from chains
Some chains return documents, tool traces, or verbose reasoning artifacts that stay in memory longer than needed. If you are using return_intermediate_steps=True, inspect whether you actually need that data.
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
return_intermediate_steps=True, # expensive if unused
)
If you only need the final answer:
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
return_intermediate_steps=False,
)
4) Oversized embeddings or vector store ingestion batches
OOM can happen before inference if your pipeline embeds too many documents at once. Large ingestion batches create big transient arrays and spike RAM.
# Risky: huge embedding batch
vectors = embeddings.embed_documents([doc.page_content for doc in docs])
# Safer: chunk ingestion manually
for i in range(0, len(docs), 32):
batch = docs[i:i+32]
vectors = embeddings.embed_documents([doc.page_content for doc in batch])
How to Debug It
- •
Check whether it is GPU OOM or system RAM OOM
- •GPU errors often look like:
- •
RuntimeError: CUDA out of memory - •
torch.cuda.OutOfMemoryError
- •
- •System RAM issues often look like:
- •
MemoryError - •worker killed with exit code
137
- •
- •GPU errors often look like:
- •
Measure prompt size and output size
- •Log token counts before invocation.
- •If using chat history, print message length per request.
- •Look for requests that are much larger than normal.
- •
Reduce concurrency to 1
- •Set
max_concurrency=1. - •If the error disappears, your issue is load amplification, not a single bad prompt.
- •Set
- •
Disable extras one by one
- •Turn off:
- •intermediate steps
- •verbose tracing
- •tool calls
- •memory persistence
- •Then rerun the same input.
- •The component that reintroduces the crash is usually your culprit.
- •Turn off:
Prevention
- •Keep one model instance per process and initialize it at startup.
- •Put hard limits on prompt size, chat history length, and batch concurrency.
- •Test with production-like load early; small local tests hide memory pressure fast.
- •Prefer smaller models or quantized models when running inference on limited hardware.
- •In LangChain chains and agents, only return what downstream code actually needs.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit