How to Fix 'OOM error during inference in production' in LangGraph (Python)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inference-in-productionlanggraphpython

When you see OOM error during inference in production with LangGraph, it usually means your process is exhausting RAM or GPU memory while the graph is running. In practice, this shows up under load, with long conversation state, large tool outputs, or when you accidentally keep multiple model contexts alive at once.

The fix is rarely “buy more memory.” It’s usually a graph/state design problem, a model invocation pattern problem, or both.

The Most Common Cause

The #1 cause is unbounded state growth inside the graph. In LangGraph, every node can append to state, and if you keep full message history, raw tool payloads, retrieved documents, and intermediate LLM outputs in one MessagesState, memory usage climbs fast.

A common mistake is to pass the entire growing conversation back into the model on every node execution.

Broken patternFixed pattern
Rebuilds prompts from full history every turnTrims/summarizes state before inference
Stores raw tool output in messagesStores only compact summaries/IDs
Keeps all intermediate nodes in memoryPersists only what downstream nodes need
# BROKEN
from langgraph.graph import StateGraph, MessagesState
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

def agent_node(state: MessagesState):
    # state["messages"] keeps growing forever
    response = llm.invoke(state["messages"])
    return {"messages": state["messages"] + [response]}

graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
# FIXED
from langgraph.graph import StateGraph, MessagesState
from langchain_openai import ChatOpenAI
from langchain_core.messages import AIMessage
from langgraph.prebuilt import trim_messages

llm = ChatOpenAI(model="gpt-4o-mini")

def agent_node(state: MessagesState):
    trimmed = trim_messages(
        state["messages"],
        max_tokens=4000,
        strategy="last",
        token_counter=llm.get_num_tokens_from_messages,
    )
    response = llm.invoke(trimmed)
    return {"messages": [AIMessage(content=response.content)]}

graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)

The key change is that the model sees a bounded context window. You still keep the stateful workflow benefits of LangGraph, but you stop feeding the entire transcript into every inference call.

Other Possible Causes

1. Tool outputs are too large

If a tool returns a huge JSON blob, HTML page, PDF text dump, or dataframe stringified into messages, you can blow up memory instantly.

# BAD: storing raw payload in graph state
return {"messages": [tool_result]}

Use compact storage instead:

# GOOD: store a summary and externalize the raw artifact
return {
    "messages": [f"Tool completed. result_id={result_id}"],
    "artifacts": {"result_id": result_id}
}

2. You are loading too many model workers or parallel branches

If your graph fans out with Send or parallel edges and each branch calls a large model concurrently, peak memory spikes fast. This gets worse if each worker loads its own weights or client buffers.

# Example: parallel fan-out can multiply memory usage
for item in items:
    graph.add_edge("router", f"worker_{item}")

Cap concurrency at the application layer and avoid unnecessary fan-out for large models.

config = {
    "max_concurrency": 2,
}

3. Your checkpointing/persistence layer is retaining too much

A bad checkpointer setup can keep every intermediate state snapshot forever. That’s useful for debugging; it’s not useful when snapshots contain giant message arrays.

# Be careful with long-lived checkpoints containing huge states
app = graph.compile(checkpointer=checkpointer)

Fix this by checkpointing smaller states and pruning old runs. If you need auditability, persist references instead of full payloads.

4. The model itself is too large for your runtime

Sometimes the problem is not LangGraph at all. If you run a large local model through vLLM, Transformers, or llama.cpp inside the same service as your graph orchestration, Python memory plus model weights plus tokenizer buffers can exceed available RAM/GPU.

model = AutoModelForCausalLM.from_pretrained(
    "big-model",
    torch_dtype="auto",
    device_map="auto",
)

If this runs inside the same container as LangGraph orchestration, split them into separate services or use a smaller model.

How to Debug It

  1. Check whether memory grows per turn

    • Log len(state["messages"]), prompt token count, and process RSS.
    • If each request increases memory steadily, you have unbounded state growth.
  2. Inspect the exact payload going into llm.invoke()

    • Print message sizes before inference.
    • Look for giant tool outputs, retrieved docs, or repeated system prompts.
  3. Measure peak concurrency

    • Count active graph runs and parallel branches.
    • If OOM happens only under load, reduce max_concurrency and test again.
  4. Separate graph memory from model memory

    • Run the same graph against a small hosted model.
    • If OOM disappears, your local model deployment is the real bottleneck.

Useful signals to log:

import os
import psutil

proc = psutil.Process(os.getpid())
print("rss_mb=", proc.memory_info().rss / 1024 / 1024)
print("message_count=", len(state["messages"]))

If you’re on GPU-backed inference:

nvidia-smi

Watch for VRAM climbing across requests instead of returning to baseline.

Prevention

  • Keep LangGraph state small.

    • Store summaries, IDs, and references.
    • Do not store raw documents or full tool responses in messages unless absolutely necessary.
  • Trim before inference.

    • Use trim_messages() or your own summarizer node before any expensive LLM call.
  • Put hard limits on concurrency and payload size.

    • Cap branch fan-out.
    • Reject oversized tool outputs early.
    • Separate orchestration from heavyweight local model serving when possible.

If you want one rule to remember: LangGraph should orchestrate work, not accumulate everything forever. Once state becomes an append-only dump of every prompt and tool result, OOM is just a matter of time.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides