How to Fix 'OOM error during inference' in LangGraph (Python)

By Cyprian AaronsUpdated 2026-04-21

oom-error-during-inferencelanggraphpython

What the error means

OOM error during inference means your process ran out of memory while a model was generating tokens or while LangGraph was holding state between nodes. In practice, this usually shows up when you pass too much conversation history, keep large objects in graph state, or run a model with a context window that is too small for the payload.

In LangGraph, this often happens after a few iterations, not on the first call. The graph keeps accumulating messages or intermediate outputs until the next LLM call blows up with a CUDA OOM, CPU memory spike, or provider-side context overflow.

The Most Common Cause

The #1 cause is unbounded state growth: every node appends full message history back into the graph state, and each inference call gets larger than the last.

Here’s the broken pattern:

Broken	Fixed
Stores every message forever	Trims or summarizes state before inference
Passes full state into every node	Passes only what the model needs
Reuses large tool outputs as-is	Extracts minimal fields

# BROKEN: unbounded message accumulation in LangGraph
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_core.messages import HumanMessage, AIMessage
from langchain_openai import ChatOpenAI

class State(TypedDict):
    messages: Annotated[list, add_messages]

llm = ChatOpenAI(model="gpt-4o-mini")

def chat_node(state: State):
    # Every turn sends the entire growing history
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

graph = StateGraph(State)
graph.add_node("chat", chat_node)
graph.set_entry_point("chat")
graph.add_edge("chat", END)
app = graph.compile()

# FIXED: trim before calling the model
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages, RemoveMessage
from langchain_core.messages import SystemMessage
from langchain_openai import ChatOpenAI

class State(TypedDict):
    messages: Annotated[list, add_messages]

llm = ChatOpenAI(model="gpt-4o-mini")

def chat_node(state: State):
    recent_messages = state["messages"][-8:]  # keep only recent turns
    response = llm.invoke(
        [SystemMessage(content="You are a helpful assistant."), *recent_messages]
    )
    return {"messages": [response]}

graph = StateGraph(State)
graph.add_node("chat", chat_node)
graph.set_entry_point("chat")
graph.add_edge("chat", END)
app = graph.compile()

If you’re using MessagesState, the same rule applies: don’t keep feeding the full transcript back into inference unless you actually need it.

A common real-world symptom looks like this:

•RuntimeError: CUDA out of memory. Tried to allocate ...
•torch.cuda.OutOfMemoryError
•ValueError: Requested tokens exceed context length
•Provider errors like 400 Bad Request: maximum context length exceeded

Other Possible Causes

1) Tool outputs are too large

If a tool returns raw PDFs, HTML pages, logs, or database dumps, you may be stuffing megabytes into graph state.

# BAD: returning raw tool output
def fetch_docs(query: str):
    return {"docs": huge_pdf_text}

# GOOD: return compact summaries or extracted fields
def fetch_docs(query: str):
    text = huge_pdf_text[:5000]
    return {"docs": text}

Keep tool results small. If you need the full artifact later, store it outside LangGraph and pass a reference ID.

2) Your model context window is too small

Even if your app memory is fine, the model can still fail because the prompt exceeds its token limit.

llm = ChatOpenAI(model="gpt-4o-mini", max_tokens=2000)

# If your prompt + history is already huge, this will still fail.

Fix by reducing history, summarizing older turns, or switching to a model with a larger context window.

3) You are holding large objects in state

LangGraph state should contain serializable control data, not DataFrames, embeddings arrays, or binary blobs.

# BAD
state["report_df"] = pandas_dataframe

# GOOD
state["report_id"] = "report_123"

If a node needs the data again, load it from object storage or a database by ID.

4) Parallel branches duplicate memory usage

Fan-out graphs can multiply memory pressure if each branch receives the same large payload.

# BAD: same huge state sent to multiple branches at once
builder.add_conditional_edges("router", route_fn)

Use smaller branch-specific inputs. If needed, split state into lightweight routing data and externalized payload references.

How to Debug It

•
Check whether memory grows per turn
- •Log message count and approximate token size before each LLM call.
- •If len(state["messages"]) keeps increasing without bound, that’s your problem.
•
Print the exact payload going into inference
- •Inspect what you send to llm.invoke(...).
- •Look for long tool outputs, repeated system prompts, or duplicated messages from reducers.
•
Separate provider OOM from local OOM
- •Local GPU OOM usually shows torch.cuda.OutOfMemoryError or CUDA allocator errors.
- •Provider/context issues show HTTP 400/413-style errors or token-limit messages.
•
Disable nodes one by one
- •Comment out tools first.
- •Then remove branching.
- •Then trim history.
- •The node that makes memory spike is usually easy to spot once you isolate it.

Prevention

•
Keep LangGraph state small and explicit.
- •Store IDs, summaries, and routing flags.
- •Put large artifacts in S3, Redis, Postgres, or object storage.
•
Add message trimming early.
- •Use sliding windows for chat flows.
- •Summarize older turns before they hit your main reasoning node.
•
Put guardrails around tool output.
- •Truncate logs.
- •Extract only relevant fields from documents.
- •Never return raw binary blobs into graph state.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit