How to Fix 'intermittent 500 errors' in LangGraph (Python)
When LangGraph starts throwing intermittent 500 errors, it usually means your graph is failing inside a node, but the failure only shows up under certain inputs or execution paths. In practice, this is often caused by state mutation, non-serializable objects in state, or nodes that sometimes return malformed updates.
The annoying part is that the same graph can work 9 times and fail on the 10th. That usually points to a hidden branch, a race with external I/O, or a bad state shape that only appears for specific messages.
The Most Common Cause
The #1 cause I see is mutating shared state or returning an invalid partial update from a node.
LangGraph expects each node to return a dict that matches your state schema. If you mutate nested structures in place, or return something like None, a string, or a list instead of a state update dict, you can get errors like:
- •
InvalidUpdateError: Expected dict, got ... - •
KeyErrorinside downstream nodes - •HTTP 500 from your FastAPI wrapper when the graph exception bubbles up
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
Mutates state in place | Returns a new update dict |
Sometimes returns None | Always returns valid state keys |
| Stores raw client objects in state | Stores only serializable data |
# BROKEN
from langgraph.graph import StateGraph, END
from typing import TypedDict, List
class ChatState(TypedDict):
messages: List[dict]
def add_message(state: ChatState):
# In-place mutation is risky
state["messages"].append({"role": "assistant", "content": "ok"})
# Sometimes this branch returns nothing -> InvalidUpdateError
if len(state["messages"]) % 2 == 0:
return {"messages": state["messages"]}
return None
graph = StateGraph(ChatState)
graph.add_node("add_message", add_message)
# FIXED
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
class ChatState(TypedDict):
messages: Annotated[list[dict], operator.add]
def add_message(state: ChatState):
# Return only the delta; do not mutate input state
return {"messages": [{"role": "assistant", "content": "ok"}]}
graph = StateGraph(ChatState)
graph.add_node("add_message", add_message)
If you are using reducers like operator.add, LangGraph can merge updates safely. If you are not using reducers, still return a fresh dict and avoid touching the input object.
Other Possible Causes
1) Non-serializable objects in state
If you put things like DB sessions, HTTP clients, file handles, or Pydantic model instances with custom fields into graph state, one request may pass and another may fail when the runtime tries to copy or serialize state.
# Bad: storing live objects in LangGraph state
state["db"] = session
state["client"] = httpx.AsyncClient()
Use IDs or plain JSON-safe values instead:
# Good: store references, not live objects
state["user_id"] = user.id
state["request_id"] = request_id
2) A node raises an exception intermittently during external I/O
This is common when calling LLMs, vector stores, or internal APIs. A timeout becomes a 500 if you do not catch and classify it.
def fetch_context(state):
resp = requests.get("https://internal-api/context", timeout=2)
resp.raise_for_status()
return {"context": resp.json()}
Add retries and explicit handling:
from requests.exceptions import Timeout, HTTPError
def fetch_context(state):
try:
resp = requests.get("https://internal-api/context", timeout=2)
resp.raise_for_status()
return {"context": resp.json()}
except Timeout:
return {"context": []}
except HTTPError as e:
return {"context": [], "error": f"context_fetch_failed:{e.response.status_code}"}
3) Conditional edges point to nodes that expect missing keys
A branch may route to a node that assumes state["documents"] exists. On some paths it does not.
def summarize(state):
# Fails if documents missing
text = "\n".join(doc["text"] for doc in state["documents"])
return {"summary": llm.invoke(text)}
Guard the input:
def summarize(state):
documents = state.get("documents", [])
if not documents:
return {"summary": "No documents provided."}
text = "\n".join(doc["text"] for doc in documents)
return {"summary": llm.invoke(text)}
4) Async/sync mismatch in node functions
If your graph expects async behavior but you call blocking code inside an async node, you can get timeouts that surface as intermittent failures under load.
async def enrich(state):
# Blocking call inside async function
result = requests.get("https://api.example.com/data")
return {"data": result.json()}
Use an async client:
import httpx
async def enrich(state):
async with httpx.AsyncClient(timeout=5) as client:
result = await client.get("https://api.example.com/data")
result.raise_for_status()
return {"data": result.json()}
How to Debug It
- •
Run the node functions outside LangGraph first
- •Call each node with a sample state dict.
- •If it fails standalone, the bug is in your business logic, not LangGraph.
- •
Log the exact input and output of every node
- •Print or structured-log:
- •incoming state keys
- •returned update dict
- •exceptions with stack traces
- •You want to catch
InvalidUpdateErrorand any hiddenKeyErrorbefore they become HTTP 500s.
- •Print or structured-log:
- •
Check for invalid updates
- •Every node should return either:
- •a dict with valid keys from your schema
- •or raise a controlled exception you handle upstream
- •If you see
Expected dict, got NoneType, inspect every branch in that node.
- •Every node should return either:
- •
Reduce concurrency and remove external calls
- •Disable parallel branches temporarily.
- •Stub LLM calls and API requests.
- •If the error disappears, the issue is likely race-related or external I/O related.
Prevention
- •
Keep LangGraph state small and JSON-safe.
- •Store strings, numbers, lists, dicts.
- •Do not store sessions, clients, locks, or open file handles.
- •
Make every node deterministic on bad input.
- •Use
.get()defaults. - •Return safe fallbacks instead of assuming keys exist.
- •Use
- •
Add validation at graph boundaries.
- •Validate incoming request payloads before invoking the graph.
- •Reject malformed inputs early instead of letting them turn into intermittent runtime failures.
If you are seeing 500 Internal Server Error from a LangGraph app wrapped in FastAPI or another web framework, treat it as a symptom. The real bug is almost always inside one node’s update contract, input assumptions, or external dependency handling.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit