How to Fix 'OOM error during inference when scaling' in LangGraph (Python)

By Cyprian AaronsUpdated 2026-04-21

oom-error-during-inference-when-scalinglanggraphpython

What this error means

OOM error during inference when scaling usually means your LangGraph app is creating more memory pressure than the worker can handle during a burst of concurrent graph executions. In practice, this shows up when you scale from local testing to multiple requests per second, or when a graph node keeps large state objects alive across steps.

The failure is often not in LangGraph itself. It’s usually your graph state, model client config, or concurrency pattern causing the process to hit GPU VRAM or system RAM limits.

The Most Common Cause

The #1 cause is storing too much in graph state and passing that full state into every node, especially large chat histories, retrieved documents, embeddings, or tool outputs.

LangGraph makes it easy to thread state through nodes, but if you keep appending without trimming, every step gets more expensive. Under load, that becomes torch.cuda.OutOfMemoryError, RuntimeError: CUDA out of memory, or plain process OOM kill.

Broken vs fixed pattern

Broken pattern	Fixed pattern
Keep full message history and retrieval payloads in state	Keep only the minimal working set
Pass giant dicts between nodes	Store references/IDs and fetch on demand
Let state grow unbounded per request	Trim messages and cap context size

# BROKEN
from typing import TypedDict, List
from langgraph.graph import StateGraph, END
from langchain_core.messages import BaseMessage

class State(TypedDict):
    messages: List[BaseMessage]
    docs: list  # full retrieved docs kept forever

def retrieve(state: State):
    docs = vectorstore.similarity_search(state["messages"][-1].content, k=20)
    state["docs"] = docs
    return state

def generate(state: State):
    # Full history + all docs sent every time
    prompt = {
        "messages": state["messages"],
        "docs": [d.page_content for d in state["docs"]],
    }
    return {"messages": state["messages"]}

# This pattern grows memory with every turn.

# FIXED
from typing import TypedDict, Annotated
from operator import add
from langgraph.graph import StateGraph, END
from langchain_core.messages import BaseMessage
from langgraph.prebuilt import add_messages

class State(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]
    doc_ids: list[str]

def retrieve(state: State):
    docs = vectorstore.similarity_search(state["messages"][-1].content, k=5)
    return {"doc_ids": [d.metadata["id"] for d in docs]}

def generate(state: State):
    doc_texts = fetch_docs_by_id(state["doc_ids"])  # load only what you need now
    prompt = {
        "messages": state["messages"][-8:],  # trim context window
        "docs": doc_texts,
    }
    result = llm.invoke(prompt)
    return {"messages": [result]}

# Keep state small; fetch heavy data lazily.

The key change is not “use LangGraph differently.” It’s “stop treating graph state like a dump truck.”

Other Possible Causes

1) Too much concurrency at the worker level

If you run multiple graph invocations concurrently on one GPU process, each inference can allocate its own KV cache and activations.

# Too aggressive for a single GPU worker
import asyncio

await asyncio.gather(*[
    app.ainvoke({"messages": [...]})
    for _ in range(16)
])

Fix it by limiting concurrency:

sem = asyncio.Semaphore(2)

async def guarded_invoke(payload):
    async with sem:
        return await app.ainvoke(payload)

2) Model loaded with the wrong precision or no quantization

A model that fits in fp16 might still OOM under concurrent load. If you’re using local transformers or vLLM-backed setups, verify precision and max sequence length.

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,   # expensive
)

Use lower precision where supported:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

If you’re on GPU and using long prompts, also reduce max_new_tokens and input context size.

3) Recursive graph loops that never shrink state

A conditional edge that keeps routing back without trimming messages will balloon memory fast.

def should_continue(state):
    return "tools" if len(state["messages"]) < 50 else END

Better:

def should_continue(state):
    if len(state["messages"]) > 12:
        return END
    return "tools"

Also inspect loops involving ToolNode, retries, and fallback edges. A retry policy can multiply allocations if the same failing node is re-run repeatedly.

4) Large tool outputs returned into state

A tool that returns a full HTML page, PDF text dump, or huge JSON blob can blow up your next LLM call.

def search_tool(query: str):
    return requests.get(url).text  # huge payload returned directly to graph state

Trim it before returning:

def search_tool(query: str):
    text = requests.get(url).text
    return text[:4000]

Better still: store the raw response externally and return a pointer:

return {"artifact_id": save_blob(text)}

How to Debug It

•
Check whether the OOM happens on first request or after several turns
- •First request points to model size / precision / context length.
- •After several turns points to unbounded LangGraph state growth.

•

Print state size at each node

def debug_node(state):
    print("messages:", len(state.get("messages", [])))
    print("keys:", list(state.keys()))
    return {}

If messages, docs, or tool outputs keep growing, you found the leak.

•

Measure process and GPU memory before/after each node

import torch, psutil, os

def mem_debug():
    print("rss_mb", psutil.Process(os.getpid()).memory_info().rss / 1024**2)
    print("cuda_mb", torch.cuda.memory_allocated() / 1024**2)

Run this inside suspicious nodes.

•
Disable concurrency and retries temporarily
- •Set worker concurrency to 1
- •Remove retry wrappers around LLM calls
- •Run one invocation at a time
If the OOM disappears, the problem is load amplification rather than one bad prompt.

Prevention

•
Keep LangGraph state small.
- •Store IDs, summaries, and short message windows.
- •Fetch large artifacts only when needed.
•
Put hard limits on inputs.
- •Cap max_new_tokens
- •Trim conversation history before each model call
- •Limit retrieval k
•
Control execution pressure.
- •Use semaphores or queue-based throttling for concurrent invocations.
- •Don’t scale workers faster than your GPU memory budget allows.

If you’re seeing RuntimeError: CUDA out of memory, torch.cuda.OutOfMemoryError, or a Kubernetes pod getting killed right after LangGraph inference starts scaling out, start by shrinking graph state. That fixes more cases than model swapping ever will.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit