How to Fix 'OOM error during inference in production' in LangGraph (Python)
When you see OOM error during inference in production with LangGraph, it usually means your process is exhausting RAM or GPU memory while the graph is running. In practice, this shows up under load, with long conversation state, large tool outputs, or when you accidentally keep multiple model contexts alive at once.
The fix is rarely “buy more memory.” It’s usually a graph/state design problem, a model invocation pattern problem, or both.
The Most Common Cause
The #1 cause is unbounded state growth inside the graph. In LangGraph, every node can append to state, and if you keep full message history, raw tool payloads, retrieved documents, and intermediate LLM outputs in one MessagesState, memory usage climbs fast.
A common mistake is to pass the entire growing conversation back into the model on every node execution.
| Broken pattern | Fixed pattern |
|---|---|
| Rebuilds prompts from full history every turn | Trims/summarizes state before inference |
| Stores raw tool output in messages | Stores only compact summaries/IDs |
| Keeps all intermediate nodes in memory | Persists only what downstream nodes need |
# BROKEN
from langgraph.graph import StateGraph, MessagesState
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
def agent_node(state: MessagesState):
# state["messages"] keeps growing forever
response = llm.invoke(state["messages"])
return {"messages": state["messages"] + [response]}
graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
# FIXED
from langgraph.graph import StateGraph, MessagesState
from langchain_openai import ChatOpenAI
from langchain_core.messages import AIMessage
from langgraph.prebuilt import trim_messages
llm = ChatOpenAI(model="gpt-4o-mini")
def agent_node(state: MessagesState):
trimmed = trim_messages(
state["messages"],
max_tokens=4000,
strategy="last",
token_counter=llm.get_num_tokens_from_messages,
)
response = llm.invoke(trimmed)
return {"messages": [AIMessage(content=response.content)]}
graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
The key change is that the model sees a bounded context window. You still keep the stateful workflow benefits of LangGraph, but you stop feeding the entire transcript into every inference call.
Other Possible Causes
1. Tool outputs are too large
If a tool returns a huge JSON blob, HTML page, PDF text dump, or dataframe stringified into messages, you can blow up memory instantly.
# BAD: storing raw payload in graph state
return {"messages": [tool_result]}
Use compact storage instead:
# GOOD: store a summary and externalize the raw artifact
return {
"messages": [f"Tool completed. result_id={result_id}"],
"artifacts": {"result_id": result_id}
}
2. You are loading too many model workers or parallel branches
If your graph fans out with Send or parallel edges and each branch calls a large model concurrently, peak memory spikes fast. This gets worse if each worker loads its own weights or client buffers.
# Example: parallel fan-out can multiply memory usage
for item in items:
graph.add_edge("router", f"worker_{item}")
Cap concurrency at the application layer and avoid unnecessary fan-out for large models.
config = {
"max_concurrency": 2,
}
3. Your checkpointing/persistence layer is retaining too much
A bad checkpointer setup can keep every intermediate state snapshot forever. That’s useful for debugging; it’s not useful when snapshots contain giant message arrays.
# Be careful with long-lived checkpoints containing huge states
app = graph.compile(checkpointer=checkpointer)
Fix this by checkpointing smaller states and pruning old runs. If you need auditability, persist references instead of full payloads.
4. The model itself is too large for your runtime
Sometimes the problem is not LangGraph at all. If you run a large local model through vLLM, Transformers, or llama.cpp inside the same service as your graph orchestration, Python memory plus model weights plus tokenizer buffers can exceed available RAM/GPU.
model = AutoModelForCausalLM.from_pretrained(
"big-model",
torch_dtype="auto",
device_map="auto",
)
If this runs inside the same container as LangGraph orchestration, split them into separate services or use a smaller model.
How to Debug It
- •
Check whether memory grows per turn
- •Log
len(state["messages"]), prompt token count, and process RSS. - •If each request increases memory steadily, you have unbounded state growth.
- •Log
- •
Inspect the exact payload going into
llm.invoke()- •Print message sizes before inference.
- •Look for giant tool outputs, retrieved docs, or repeated system prompts.
- •
Measure peak concurrency
- •Count active graph runs and parallel branches.
- •If OOM happens only under load, reduce
max_concurrencyand test again.
- •
Separate graph memory from model memory
- •Run the same graph against a small hosted model.
- •If OOM disappears, your local model deployment is the real bottleneck.
Useful signals to log:
import os
import psutil
proc = psutil.Process(os.getpid())
print("rss_mb=", proc.memory_info().rss / 1024 / 1024)
print("message_count=", len(state["messages"]))
If you’re on GPU-backed inference:
nvidia-smi
Watch for VRAM climbing across requests instead of returning to baseline.
Prevention
- •
Keep LangGraph state small.
- •Store summaries, IDs, and references.
- •Do not store raw documents or full tool responses in messages unless absolutely necessary.
- •
Trim before inference.
- •Use
trim_messages()or your own summarizer node before any expensive LLM call.
- •Use
- •
Put hard limits on concurrency and payload size.
- •Cap branch fan-out.
- •Reject oversized tool outputs early.
- •Separate orchestration from heavyweight local model serving when possible.
If you want one rule to remember: LangGraph should orchestrate work, not accumulate everything forever. Once state becomes an append-only dump of every prompt and tool result, OOM is just a matter of time.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit