How to Fix 'OOM error during inference' in LangGraph (Python)
What the error means
OOM error during inference means your process ran out of memory while a model was generating tokens or while LangGraph was holding state between nodes. In practice, this usually shows up when you pass too much conversation history, keep large objects in graph state, or run a model with a context window that is too small for the payload.
In LangGraph, this often happens after a few iterations, not on the first call. The graph keeps accumulating messages or intermediate outputs until the next LLM call blows up with a CUDA OOM, CPU memory spike, or provider-side context overflow.
The Most Common Cause
The #1 cause is unbounded state growth: every node appends full message history back into the graph state, and each inference call gets larger than the last.
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Stores every message forever | Trims or summarizes state before inference |
| Passes full state into every node | Passes only what the model needs |
| Reuses large tool outputs as-is | Extracts minimal fields |
# BROKEN: unbounded message accumulation in LangGraph
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_core.messages import HumanMessage, AIMessage
from langchain_openai import ChatOpenAI
class State(TypedDict):
messages: Annotated[list, add_messages]
llm = ChatOpenAI(model="gpt-4o-mini")
def chat_node(state: State):
# Every turn sends the entire growing history
response = llm.invoke(state["messages"])
return {"messages": [response]}
graph = StateGraph(State)
graph.add_node("chat", chat_node)
graph.set_entry_point("chat")
graph.add_edge("chat", END)
app = graph.compile()
# FIXED: trim before calling the model
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages, RemoveMessage
from langchain_core.messages import SystemMessage
from langchain_openai import ChatOpenAI
class State(TypedDict):
messages: Annotated[list, add_messages]
llm = ChatOpenAI(model="gpt-4o-mini")
def chat_node(state: State):
recent_messages = state["messages"][-8:] # keep only recent turns
response = llm.invoke(
[SystemMessage(content="You are a helpful assistant."), *recent_messages]
)
return {"messages": [response]}
graph = StateGraph(State)
graph.add_node("chat", chat_node)
graph.set_entry_point("chat")
graph.add_edge("chat", END)
app = graph.compile()
If you’re using MessagesState, the same rule applies: don’t keep feeding the full transcript back into inference unless you actually need it.
A common real-world symptom looks like this:
- •
RuntimeError: CUDA out of memory. Tried to allocate ... - •
torch.cuda.OutOfMemoryError - •
ValueError: Requested tokens exceed context length - •Provider errors like
400 Bad Request: maximum context length exceeded
Other Possible Causes
1) Tool outputs are too large
If a tool returns raw PDFs, HTML pages, logs, or database dumps, you may be stuffing megabytes into graph state.
# BAD: returning raw tool output
def fetch_docs(query: str):
return {"docs": huge_pdf_text}
# GOOD: return compact summaries or extracted fields
def fetch_docs(query: str):
text = huge_pdf_text[:5000]
return {"docs": text}
Keep tool results small. If you need the full artifact later, store it outside LangGraph and pass a reference ID.
2) Your model context window is too small
Even if your app memory is fine, the model can still fail because the prompt exceeds its token limit.
llm = ChatOpenAI(model="gpt-4o-mini", max_tokens=2000)
# If your prompt + history is already huge, this will still fail.
Fix by reducing history, summarizing older turns, or switching to a model with a larger context window.
3) You are holding large objects in state
LangGraph state should contain serializable control data, not DataFrames, embeddings arrays, or binary blobs.
# BAD
state["report_df"] = pandas_dataframe
# GOOD
state["report_id"] = "report_123"
If a node needs the data again, load it from object storage or a database by ID.
4) Parallel branches duplicate memory usage
Fan-out graphs can multiply memory pressure if each branch receives the same large payload.
# BAD: same huge state sent to multiple branches at once
builder.add_conditional_edges("router", route_fn)
Use smaller branch-specific inputs. If needed, split state into lightweight routing data and externalized payload references.
How to Debug It
- •
Check whether memory grows per turn
- •Log message count and approximate token size before each LLM call.
- •If
len(state["messages"])keeps increasing without bound, that’s your problem.
- •
Print the exact payload going into inference
- •Inspect what you send to
llm.invoke(...). - •Look for long tool outputs, repeated system prompts, or duplicated messages from reducers.
- •Inspect what you send to
- •
Separate provider OOM from local OOM
- •Local GPU OOM usually shows
torch.cuda.OutOfMemoryErroror CUDA allocator errors. - •Provider/context issues show HTTP 400/413-style errors or token-limit messages.
- •Local GPU OOM usually shows
- •
Disable nodes one by one
- •Comment out tools first.
- •Then remove branching.
- •Then trim history.
- •The node that makes memory spike is usually easy to spot once you isolate it.
Prevention
- •
Keep LangGraph state small and explicit.
- •Store IDs, summaries, and routing flags.
- •Put large artifacts in S3, Redis, Postgres, or object storage.
- •
Add message trimming early.
- •Use sliding windows for chat flows.
- •Summarize older turns before they hit your main reasoning node.
- •
Put guardrails around tool output.
- •Truncate logs.
- •Extract only relevant fields from documents.
- •Never return raw binary blobs into graph state.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit