How to Fix 'intermittent 500 errors when scaling' in LangGraph (Python)

By Cyprian AaronsUpdated 2026-04-21
intermittent-500-errors-when-scalinglanggraphpython

Intermittent 500 errors when scaling a LangGraph app usually mean your graph is fine in a single-process dev run, but breaks under concurrency, retries, or multiple worker replicas. In practice, this shows up when state is not isolated per run, checkpoints are misconfigured, or a node depends on mutable global objects.

The symptom is ugly: some requests succeed, others fail with HTTP 500, and the logs often show LangGraph runtime errors like InvalidUpdateError, GraphRecursionError, or plain Python exceptions from shared state being mutated at the wrong time.

The Most Common Cause

The #1 cause is shared mutable state across concurrent graph runs.

This happens when you keep request data in module-level variables, reuse the same mutable object across invocations, or mutate a dict/list that LangGraph nodes read from multiple threads/workers. Under light load it looks fine. Under scale, one run overwrites another and the graph explodes with inconsistent state.

Broken vs fixed pattern

Broken patternFixed pattern
Reuses shared mutable objectsCreates fresh state per invocation
Mutates globals inside nodesTreats node input as immutable
Depends on process-local memory for request dataPasses all request-specific data through graph state
# broken.py
from typing import TypedDict
from langgraph.graph import StateGraph, END

class State(TypedDict):
    messages: list[str]
    user_id: str

# Shared mutable object. This will bleed across requests.
DEFAULT_STATE = {"messages": [], "user_id": ""}

def add_message(state: State):
    DEFAULT_STATE["messages"].append(f"hello from {state['user_id']}")
    return DEFAULT_STATE

builder = StateGraph(State)
builder.add_node("add_message", add_message)
builder.set_entry_point("add_message")
builder.add_edge("add_message", END)
graph = builder.compile()

# Under load, different requests can mutate DEFAULT_STATE together.
result = graph.invoke({"messages": [], "user_id": "u123"})
# fixed.py
from typing import TypedDict
from langgraph.graph import StateGraph, END

class State(TypedDict):
    messages: list[str]
    user_id: str

def add_message(state: State):
    # Create a new object every time.
    return {
        "messages": state["messages"] + [f"hello from {state['user_id']}"],
        "user_id": state["user_id"],
    }

builder = StateGraph(State)
builder.add_node("add_message", add_message)
builder.set_entry_point("add_message")
builder.add_edge("add_message", END)
graph = builder.compile()

result = graph.invoke({"messages": [], "user_id": "u123"})

If you see errors like:

  • langgraph.errors.InvalidUpdateError: Expected dict, got list
  • langgraph.errors.GraphRecursionError: Recursion limit of 25 reached without hitting a stop condition
  • random KeyError / TypeError only under load

check for shared mutable state first.

Other Possible Causes

1) Missing or broken checkpointing in multi-replica deployments

If you scale horizontally and rely on conversational state, you need a durable checkpointer. Without it, one replica may not see the previous step and your graph can restart mid-run.

# bad: no durable checkpointing for multi-instance deployment
graph = builder.compile()
# good: use a persistent checkpointer
from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver.from_conn_string("checkpoints.db")
graph = builder.compile(checkpointer=checkpointer)

If you're running multiple pods, use a real shared backend instead of local SQLite.

2) Non-idempotent side effects inside nodes

A node that writes to a database, sends an email, or calls an external API can be retried by your app server or upstream gateway. That creates duplicate writes and intermittent failures that look like LangGraph instability.

def charge_card(state):
    # bad: side effect happens before you know the run is safe
    payment_api.charge(state["amount"])
    return {"status": "charged"}

Fix it by making the node idempotent and storing an operation key:

def charge_card(state):
    op_id = state["run_id"]
    if payments.exists(op_id):
        return {"status": "already_charged"}
    payments.charge(idempotency_key=op_id, amount=state["amount"])
    return {"status": "charged"}

3) Concurrency bugs in async nodes

Mixing sync and async code incorrectly can create race conditions or event-loop issues under load.

# bad
async def fetch_profile(state):
    profile = requests.get(f"https://api/users/{state['user_id']}").json()
    return {"profile": profile}

Use an async client:

import httpx

async def fetch_profile(state):
    async with httpx.AsyncClient(timeout=10) as client:
        resp = await client.get(f"https://api/users/{state['user_id']}")
        resp.raise_for_status()
        return {"profile": resp.json()}

4) Graph cycles without a hard stop condition

A conditional edge that never reaches END will work in some cases and fail under certain inputs with recursion-limit errors.

# bad: no termination path for some states
builder.add_conditional_edges(
    "router",
    lambda s: "router" if s["needs_more"] else END,
)

Make sure every branch can terminate:

def route(state):
    if state["attempts"] >= 3:
        return END
    return "router"

How to Debug It

  1. Reproduce with concurrency turned up

    • Hit the endpoint with 20–50 parallel requests.
    • If failures appear only under load, suspect shared state or non-idempotent side effects.
  2. Log the full exception class and LangGraph traceback

    • Look for langgraph.errors.InvalidUpdateError, langgraph.errors.GraphRecursionError, or node-level exceptions.
    • Add logging around each node entry/exit with a request/run ID.
  3. Disable all external side effects

    • Temporarily stub database writes, HTTP calls, email sending, and cache mutations.
    • If the 500s disappear, the bug is outside the graph engine.
  4. Run with a single worker and then scale up

    • Compare behavior between:
      • uvicorn app:app --workers 1
      • multiple workers / pods behind a load balancer
    • If only multi-worker fails, your state/checkpoint strategy is wrong.

Prevention

  • Treat LangGraph state as immutable per run. Return new dicts instead of mutating shared objects.
  • Use durable checkpointers when you need cross-request memory or multi-replica deployment.
  • Make every side effect idempotent with a run ID or idempotency key.

If you’re seeing intermittent 500s only after scaling LangGraph in Python, start by removing shared mutable state and then verify checkpointing. In production systems, that fixes most of these incidents faster than anything else.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides