How to Fix 'OOM error during inference during development' in LangGraph (Python)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inference-during-developmentlanggraphpython

When you see OOM error during inference during development in a LangGraph app, it usually means your process ran out of memory while executing a node, compiling state, or holding too much conversation/context in RAM. In practice, this shows up during local dev when you keep appending messages, return large objects from nodes, or accidentally create a graph that re-runs heavy work on every step.

The important part: this is usually not a LangGraph bug. It’s almost always a state shape problem, a context growth problem, or a model/runtime config issue.

The Most Common Cause

The #1 cause is unbounded state growth.

In LangGraph, developers often store the full message history in state and keep appending to it on every turn. That works for a few steps, then memory balloons because each node sees the entire history again and again.

Here’s the broken pattern:

# broken.py
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_core.messages import HumanMessage, AIMessage

class State(TypedDict):
    messages: Annotated[list, add_messages]

def chat_node(state: State):
    # returns the entire growing list every time
    response = AIMessage(content="some long response...")
    return {"messages": [response]}

graph = StateGraph(State)
graph.add_node("chat", chat_node)
graph.set_entry_point("chat")
graph.add_edge("chat", END)
app = graph.compile()

And here’s the fixed pattern:

# fixed.py
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_core.messages import HumanMessage, AIMessage

class State(TypedDict):
    messages: Annotated[list, add_messages]

def chat_node(state: State):
    # only append the new message; don't duplicate prior history
    response = AIMessage(content="some long response...")
    return {"messages": [response]}

graph = StateGraph(State)
graph.add_node("chat", chat_node)
graph.set_entry_point("chat")
graph.add_edge("chat", END)
app = graph.compile()

The code looks similar because the real fix is not “return less” in one node. The fix is to control what goes into messages upstream:

  • trim history before passing it into the graph
  • summarize older turns
  • avoid storing raw documents or tool payloads in message state

If you are using add_messages, remember it merges forever unless you explicitly prune.

Other Possible Causes

1) Returning large tool outputs into graph state

If a tool returns a giant JSON blob or a full document dump, LangGraph will keep that in state and feed it back through later nodes.

# bad: returning huge payloads directly into state
return {
    "tool_result": huge_json_blob,
    "messages": [AIMessage(content=f"Got {len(huge_json_blob)} bytes")]
}

Fix by storing only references or summaries:

# better: store an ID or summary instead of the full blob
return {
    "tool_result_id": blob_id,
    "messages": [AIMessage(content="Tool completed successfully")]
}

2) Recursive loops with no stop condition

A graph that cycles without a hard exit will eventually allocate more memory until Python dies.

# risky loop pattern
graph.add_edge("agent", "tools")
graph.add_edge("tools", "agent")  # no guardrail if the agent keeps calling tools

Use conditional routing with an explicit stop:

def should_continue(state):
    return "tools" if state["needs_tool"] else END

graph.add_conditional_edges("agent", should_continue)

3) Model context window overflow

Sometimes the OOM happens inside inference because your prompt is too large for the model backend. This often surfaces as provider-side errors like:

  • CUDA out of memory
  • RuntimeError: CUDA error: out of memory
  • ValueError: Input length exceeds model context window

Fix by truncating or summarizing before calling the model:

def compact_messages(messages):
    return messages[-10:]  # keep last 10 turns only

4) Loading heavyweight models locally during development

If you’re using local inference with Ollama, vLLM, Transformers, or PyTorch models inside your LangGraph node, memory pressure can come from the model itself rather than LangGraph.

# heavy local model load inside node execution is expensive
model = AutoModelForCausalLM.from_pretrained("big-model-name").to("cuda")

Move model loading outside the node and reuse it:

model = AutoModelForCausalLM.from_pretrained("big-model-name").to("cuda")

def infer_node(state):
    return {"result": model.generate(...)}

How to Debug It

  1. Check whether memory grows per turn

    • Run one graph step at a time.
    • Watch RSS/VRAM with top, htop, nvidia-smi, or psutil.
    • If memory climbs on every invocation, you have state accumulation.
  2. Print the size of your state

    • Log message count and payload sizes before each node runs.
    • Look for giant strings, documents, PDFs converted to text, or raw tool responses.
def debug_state(state):
    print("messages:", len(state.get("messages", [])))
    print("keys:", list(state.keys()))
  1. Inspect your graph edges

    • Check for accidental cycles.
    • Verify every loop has a stop condition.
    • In LangGraph terms, make sure your conditional edges can actually reach END.
  2. Reduce to one node

    • Temporarily remove tools and extra branches.
    • If the OOM disappears in a minimal graph built with StateGraph and compile(), the issue is in routing or payload size.
    • Reintroduce nodes one by one until memory spikes again.

Prevention

  • Keep graph state small:

    • store IDs, summaries, and short message windows instead of raw blobs
  • Put hard limits on history:

    • truncate messages before they hit the model
    • summarize older conversation turns after N steps
  • Treat tool output as untrusted payload:

    • never dump full API responses directly into messages or shared state
  • Add runtime checks:

    • log token counts before inference
    • reject oversized inputs early instead of letting Python crash later

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides