How to Fix 'OOM error during inference when scaling' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21

oom-error-during-inference-when-scalinglangchainpython

When you see OOM error during inference when scaling in a LangChain Python app, it usually means your process ran out of memory while the model was being called, not while the chain was being built. In practice, this shows up when traffic increases, prompts get larger, or you accidentally keep too many model objects and intermediate outputs alive.

The failure often appears with local models, GPU-backed inference, or batch-heavy chains. You may also see related runtime errors like CUDA out of memory, MemoryError, or a worker getting killed by the OS after LangChain starts fanning out requests.

The Most Common Cause

The #1 cause is uncontrolled concurrency. A chain that works for one request can blow up under load when multiple invocations hit the same model instance at once, especially with large context windows or local LLMs.

The broken pattern is usually a RunnableParallel, batch(), async fan-out, or a web endpoint that lets too many requests through at once.

Broken pattern	Fixed pattern
Fire off many generations at once	Limit concurrency and batch size
Reuse a heavy local model without backpressure	Serialize or throttle inference
Keep full chat history in every request	Trim memory before calling the LLM

# Broken: unbounded parallel inference
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms import Ollama
from langchain_core.runnables import RunnableLambda

llm = Ollama(model="llama3")
prompt = ChatPromptTemplate.from_template("Summarize this text:\n{text}")

chain = prompt | llm

texts = [{"text": t} for t in huge_text_list]

# This can trigger:
# RuntimeError: CUDA out of memory
# or the process gets killed under load
results = chain.batch(texts)

# Fixed: cap concurrency and reduce payload size
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms import Ollama

llm = Ollama(model="llama3")
prompt = ChatPromptTemplate.from_template("Summarize this text:\n{text}")
chain = prompt | llm

texts = [{"text": t[:4000]} for t in huge_text_list]  # trim input early

results = chain.batch(
    texts,
    config={"max_concurrency": 2}  # backpressure
)

If you are using FastAPI, Celery, or any worker pool, the same rule applies. Two workers with a 7B model may be fine; twenty concurrent requests on the same box will not be.

Other Possible Causes

1) Prompt growth from chat history

LangChain memory can quietly accumulate token load across turns. If you keep appending full conversation history into every call, each request gets more expensive than the last.

# Problematic: unbounded conversation growth
history = "\n".join(messages)

response = chain.invoke({
    "question": user_input,
    "history": history,
})

# Better: trim history before passing it to the chain
from langchain_core.messages import trim_messages

trimmed_messages = trim_messages(
    messages,
    max_tokens=2000,
    strategy="last",
)

response = chain.invoke({
    "question": user_input,
    "history": trimmed_messages,
})

2) Loading the model inside the request path

If you instantiate the model on every request, memory fragments fast and peak usage spikes. This is common with local embeddings or LLM wrappers inside Flask/FastAPI endpoints.

# Bad: reloads model every request
@app.post("/infer")
def infer(payload: dict):
    llm = HuggingFacePipeline.from_model_id(...)
    return llm.invoke(payload["text"])

# Better: create once at startup and reuse
llm = HuggingFacePipeline.from_model_id(...)

@app.post("/infer")
def infer(payload: dict):
    return llm.invoke(payload["text"])

3) Returning giant intermediate outputs from chains

Some chains return documents, tool traces, or verbose reasoning artifacts that stay in memory longer than needed. If you are using return_intermediate_steps=True, inspect whether you actually need that data.

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    return_intermediate_steps=True,  # expensive if unused
)

If you only need the final answer:

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    return_intermediate_steps=False,
)

4) Oversized embeddings or vector store ingestion batches

OOM can happen before inference if your pipeline embeds too many documents at once. Large ingestion batches create big transient arrays and spike RAM.

# Risky: huge embedding batch
vectors = embeddings.embed_documents([doc.page_content for doc in docs])

# Safer: chunk ingestion manually
for i in range(0, len(docs), 32):
    batch = docs[i:i+32]
    vectors = embeddings.embed_documents([doc.page_content for doc in batch])

How to Debug It

•
Check whether it is GPU OOM or system RAM OOM
- •
  GPU errors often look like:
  - •RuntimeError: CUDA out of memory
  - •torch.cuda.OutOfMemoryError
- •
  System RAM issues often look like:
  - •MemoryError
  - •worker killed with exit code 137
•
Measure prompt size and output size
- •Log token counts before invocation.
- •If using chat history, print message length per request.
- •Look for requests that are much larger than normal.
•
Reduce concurrency to 1
- •Set max_concurrency=1.
- •If the error disappears, your issue is load amplification, not a single bad prompt.
•
Disable extras one by one
- •
  Turn off:
  - •intermediate steps
  - •verbose tracing
  - •tool calls
  - •memory persistence
- •Then rerun the same input.
- •The component that reintroduces the crash is usually your culprit.

Prevention

•Keep one model instance per process and initialize it at startup.
•Put hard limits on prompt size, chat history length, and batch concurrency.
•Test with production-like load early; small local tests hide memory pressure fast.
•Prefer smaller models or quantized models when running inference on limited hardware.
•In LangChain chains and agents, only return what downstream code actually needs.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit