How to Fix 'intermittent 500 errors when scaling' in LangGraph (Python)
Intermittent 500 errors when scaling a LangGraph app usually mean your graph is fine in a single-process dev run, but breaks under concurrency, retries, or multiple worker replicas. In practice, this shows up when state is not isolated per run, checkpoints are misconfigured, or a node depends on mutable global objects.
The symptom is ugly: some requests succeed, others fail with HTTP 500, and the logs often show LangGraph runtime errors like InvalidUpdateError, GraphRecursionError, or plain Python exceptions from shared state being mutated at the wrong time.
The Most Common Cause
The #1 cause is shared mutable state across concurrent graph runs.
This happens when you keep request data in module-level variables, reuse the same mutable object across invocations, or mutate a dict/list that LangGraph nodes read from multiple threads/workers. Under light load it looks fine. Under scale, one run overwrites another and the graph explodes with inconsistent state.
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Reuses shared mutable objects | Creates fresh state per invocation |
| Mutates globals inside nodes | Treats node input as immutable |
| Depends on process-local memory for request data | Passes all request-specific data through graph state |
# broken.py
from typing import TypedDict
from langgraph.graph import StateGraph, END
class State(TypedDict):
messages: list[str]
user_id: str
# Shared mutable object. This will bleed across requests.
DEFAULT_STATE = {"messages": [], "user_id": ""}
def add_message(state: State):
DEFAULT_STATE["messages"].append(f"hello from {state['user_id']}")
return DEFAULT_STATE
builder = StateGraph(State)
builder.add_node("add_message", add_message)
builder.set_entry_point("add_message")
builder.add_edge("add_message", END)
graph = builder.compile()
# Under load, different requests can mutate DEFAULT_STATE together.
result = graph.invoke({"messages": [], "user_id": "u123"})
# fixed.py
from typing import TypedDict
from langgraph.graph import StateGraph, END
class State(TypedDict):
messages: list[str]
user_id: str
def add_message(state: State):
# Create a new object every time.
return {
"messages": state["messages"] + [f"hello from {state['user_id']}"],
"user_id": state["user_id"],
}
builder = StateGraph(State)
builder.add_node("add_message", add_message)
builder.set_entry_point("add_message")
builder.add_edge("add_message", END)
graph = builder.compile()
result = graph.invoke({"messages": [], "user_id": "u123"})
If you see errors like:
- •
langgraph.errors.InvalidUpdateError: Expected dict, got list - •
langgraph.errors.GraphRecursionError: Recursion limit of 25 reached without hitting a stop condition - •random
KeyError/TypeErroronly under load
check for shared mutable state first.
Other Possible Causes
1) Missing or broken checkpointing in multi-replica deployments
If you scale horizontally and rely on conversational state, you need a durable checkpointer. Without it, one replica may not see the previous step and your graph can restart mid-run.
# bad: no durable checkpointing for multi-instance deployment
graph = builder.compile()
# good: use a persistent checkpointer
from langgraph.checkpoint.sqlite import SqliteSaver
checkpointer = SqliteSaver.from_conn_string("checkpoints.db")
graph = builder.compile(checkpointer=checkpointer)
If you're running multiple pods, use a real shared backend instead of local SQLite.
2) Non-idempotent side effects inside nodes
A node that writes to a database, sends an email, or calls an external API can be retried by your app server or upstream gateway. That creates duplicate writes and intermittent failures that look like LangGraph instability.
def charge_card(state):
# bad: side effect happens before you know the run is safe
payment_api.charge(state["amount"])
return {"status": "charged"}
Fix it by making the node idempotent and storing an operation key:
def charge_card(state):
op_id = state["run_id"]
if payments.exists(op_id):
return {"status": "already_charged"}
payments.charge(idempotency_key=op_id, amount=state["amount"])
return {"status": "charged"}
3) Concurrency bugs in async nodes
Mixing sync and async code incorrectly can create race conditions or event-loop issues under load.
# bad
async def fetch_profile(state):
profile = requests.get(f"https://api/users/{state['user_id']}").json()
return {"profile": profile}
Use an async client:
import httpx
async def fetch_profile(state):
async with httpx.AsyncClient(timeout=10) as client:
resp = await client.get(f"https://api/users/{state['user_id']}")
resp.raise_for_status()
return {"profile": resp.json()}
4) Graph cycles without a hard stop condition
A conditional edge that never reaches END will work in some cases and fail under certain inputs with recursion-limit errors.
# bad: no termination path for some states
builder.add_conditional_edges(
"router",
lambda s: "router" if s["needs_more"] else END,
)
Make sure every branch can terminate:
def route(state):
if state["attempts"] >= 3:
return END
return "router"
How to Debug It
- •
Reproduce with concurrency turned up
- •Hit the endpoint with 20–50 parallel requests.
- •If failures appear only under load, suspect shared state or non-idempotent side effects.
- •
Log the full exception class and LangGraph traceback
- •Look for
langgraph.errors.InvalidUpdateError,langgraph.errors.GraphRecursionError, or node-level exceptions. - •Add logging around each node entry/exit with a request/run ID.
- •Look for
- •
Disable all external side effects
- •Temporarily stub database writes, HTTP calls, email sending, and cache mutations.
- •If the 500s disappear, the bug is outside the graph engine.
- •
Run with a single worker and then scale up
- •Compare behavior between:
- •
uvicorn app:app --workers 1 - •multiple workers / pods behind a load balancer
- •
- •If only multi-worker fails, your state/checkpoint strategy is wrong.
- •Compare behavior between:
Prevention
- •Treat LangGraph state as immutable per run. Return new dicts instead of mutating shared objects.
- •Use durable checkpointers when you need cross-request memory or multi-replica deployment.
- •Make every side effect idempotent with a run ID or idempotency key.
If you’re seeing intermittent 500s only after scaling LangGraph in Python, start by removing shared mutable state and then verify checkpointing. In production systems, that fixes most of these incidents faster than anything else.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit