How to Fix 'intermittent 500 errors in production' in LangGraph (Python)
Intermittent 500 errors in LangGraph usually mean your graph is failing at runtime, not at compile time. In practice, this shows up when a node raises an exception only for certain inputs, or when state mutation, tool calls, or async execution behaves differently under load.
The hard part is that the same graph can work 99 times and fail on the 100th request. That’s why these bugs often show up only in production logs as Internal Server Error or langgraph.errors.GraphExecutionError.
The Most Common Cause
The #1 cause is non-deterministic node code that assumes state fields always exist.
In LangGraph, every node should treat state as the source of truth and return a valid partial update. If you mutate nested objects in place, access missing keys directly, or depend on side effects from previous nodes, you’ll get intermittent failures depending on the input path.
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Mutates state in place | Returns a new partial state |
| Assumes keys always exist | Uses .get() / validation |
Raises KeyError / TypeError on some inputs | Handles empty or partial state safely |
# BROKEN
from typing import TypedDict
from langgraph.graph import StateGraph, END
class State(TypedDict):
messages: list
user_profile: dict
def enrich_profile(state: State):
# Fails intermittently if user_profile is missing fields
state["user_profile"]["tier"] = state["user_profile"]["tier"].upper()
return state
graph = StateGraph(State)
graph.add_node("enrich_profile", enrich_profile)
graph.set_entry_point("enrich_profile")
graph.add_edge("enrich_profile", END)
app = graph.compile()
# FIXED
from typing import TypedDict, Optional
from langgraph.graph import StateGraph, END
class State(TypedDict, total=False):
messages: list
user_profile: dict
def enrich_profile(state: State):
profile = state.get("user_profile") or {}
tier = profile.get("tier", "unknown")
return {
"user_profile": {
**profile,
"tier": tier.upper() if isinstance(tier, str) else "UNKNOWN",
}
}
graph = StateGraph(State)
graph.add_node("enrich_profile", enrich_profile)
graph.set_entry_point("enrich_profile")
graph.add_edge("enrich_profile", END)
app = graph.compile()
If you’re seeing logs like:
- •
KeyError: 'user_profile' - •
TypeError: 'NoneType' object is not subscriptable - •
langgraph.errors.GraphExecutionError: Error in node 'enrich_profile'
this is usually the culprit.
Other Possible Causes
1) Tool exceptions bubbling out of agent nodes
If a tool call fails and you don’t catch it, the whole graph fails with a 500.
def call_crm_tool(state):
result = crm_client.lookup_customer(state["customer_id"])
return {"crm_data": result}
Fix it by wrapping tool execution and returning an error field instead of crashing the node.
def call_crm_tool(state):
try:
result = crm_client.lookup_customer(state["customer_id"])
return {"crm_data": result}
except Exception as e:
return {"tool_error": f"CRM lookup failed: {e}"}
2) Async/sync mismatch inside nodes
A common production issue is calling async code from sync nodes, or vice versa. This often surfaces as:
- •
RuntimeError: This event loop is already running - •
TypeError: object coroutine can't be used in 'await' expression
# BROKEN
def fetch_policy(state):
data = async_fetch_policy(state["policy_id"]) # coroutine not awaited
return {"policy": data}
# FIXED
async def fetch_policy(state):
data = await async_fetch_policy(state["policy_id"])
return {"policy": data}
Make sure the graph node type matches how you execute it.
3) Invalid conditional routing
If your router returns a label that doesn’t match any edge, execution can fail depending on the branch taken.
def route(state):
if state.get("risk_score", 0) > 80:
return "high_risk"
return "unknown_route" # no edge for this
Fix by keeping route labels aligned with your graph edges.
def route(state):
if state.get("risk_score", 0) > 80:
return "high_risk"
return "low_risk"
Also verify your conditional edges:
graph.add_conditional_edges(
"router",
route,
{
"high_risk": "manual_review",
"low_risk": "auto_approve",
},
)
4) Shared mutable globals across requests
This one causes classic “works locally, fails under load” behavior.
CACHE = {}
def node(state):
CACHE["last_customer"] = state["customer_id"]
# race condition across requests
Use request-scoped state or external storage with proper locking.
def node(state):
customer_id = state["customer_id"]
return {"customer_id": customer_id}
How to Debug It
- •
Capture the full stack trace
- •Don’t stop at
500 Internal Server Error. - •Look for the real exception type:
- •
KeyError - •
TypeError - •
ValueError - •
langgraph.errors.GraphExecutionError
- •
- •The failing node name is usually in the traceback.
- •Don’t stop at
- •
Run the exact failing input locally
- •Copy one production payload into a test.
- •Call
app.invoke(payload)directly. - •If it only fails sometimes, compare successful and failing inputs field by field.
- •
Add per-node logging
- •Log input shape before each risky operation.
- •Log route decisions and tool arguments.
- •Example:
def enrich_profile(state):
print("enrich_profile input:", state)
...
- •Isolate nodes one by one
- •Comment out tools first.
- •Then remove conditional edges.
- •Then replace async nodes with synchronous stubs.
- •The first step that stops the error usually tells you where the bug lives.
Prevention
- •Treat every node as a pure function over input state.
- •Validate required fields at graph boundaries instead of deep inside business logic.
- •Wrap external calls in retry/error-handling logic and return structured failures instead of raising immediately.
- •Add tests for:
- •missing keys
- •empty lists
- •tool failures
- •invalid route labels
- •concurrent requests
If you’re building LangGraph workflows for production systems like claims triage or KYC review, assume every unvalidated branch will eventually be hit. The fix is not “add more retries” everywhere. It’s making each node deterministic, defensive, and explicit about failure modes.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit