How to Fix 'intermittent 500 errors' in LangGraph (Python)

By Cyprian AaronsUpdated 2026-04-21
intermittent-500-errorslanggraphpython

When LangGraph starts throwing intermittent 500 errors, it usually means your graph is failing inside a node, but the failure only shows up under certain inputs or execution paths. In practice, this is often caused by state mutation, non-serializable objects in state, or nodes that sometimes return malformed updates.

The annoying part is that the same graph can work 9 times and fail on the 10th. That usually points to a hidden branch, a race with external I/O, or a bad state shape that only appears for specific messages.

The Most Common Cause

The #1 cause I see is mutating shared state or returning an invalid partial update from a node.

LangGraph expects each node to return a dict that matches your state schema. If you mutate nested structures in place, or return something like None, a string, or a list instead of a state update dict, you can get errors like:

  • InvalidUpdateError: Expected dict, got ...
  • KeyError inside downstream nodes
  • HTTP 500 from your FastAPI wrapper when the graph exception bubbles up

Broken vs fixed pattern

Broken patternFixed pattern
Mutates state in placeReturns a new update dict
Sometimes returns NoneAlways returns valid state keys
Stores raw client objects in stateStores only serializable data
# BROKEN
from langgraph.graph import StateGraph, END
from typing import TypedDict, List

class ChatState(TypedDict):
    messages: List[dict]

def add_message(state: ChatState):
    # In-place mutation is risky
    state["messages"].append({"role": "assistant", "content": "ok"})
    # Sometimes this branch returns nothing -> InvalidUpdateError
    if len(state["messages"]) % 2 == 0:
        return {"messages": state["messages"]}
    return None

graph = StateGraph(ChatState)
graph.add_node("add_message", add_message)
# FIXED
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class ChatState(TypedDict):
    messages: Annotated[list[dict], operator.add]

def add_message(state: ChatState):
    # Return only the delta; do not mutate input state
    return {"messages": [{"role": "assistant", "content": "ok"}]}

graph = StateGraph(ChatState)
graph.add_node("add_message", add_message)

If you are using reducers like operator.add, LangGraph can merge updates safely. If you are not using reducers, still return a fresh dict and avoid touching the input object.

Other Possible Causes

1) Non-serializable objects in state

If you put things like DB sessions, HTTP clients, file handles, or Pydantic model instances with custom fields into graph state, one request may pass and another may fail when the runtime tries to copy or serialize state.

# Bad: storing live objects in LangGraph state
state["db"] = session
state["client"] = httpx.AsyncClient()

Use IDs or plain JSON-safe values instead:

# Good: store references, not live objects
state["user_id"] = user.id
state["request_id"] = request_id

2) A node raises an exception intermittently during external I/O

This is common when calling LLMs, vector stores, or internal APIs. A timeout becomes a 500 if you do not catch and classify it.

def fetch_context(state):
    resp = requests.get("https://internal-api/context", timeout=2)
    resp.raise_for_status()
    return {"context": resp.json()}

Add retries and explicit handling:

from requests.exceptions import Timeout, HTTPError

def fetch_context(state):
    try:
        resp = requests.get("https://internal-api/context", timeout=2)
        resp.raise_for_status()
        return {"context": resp.json()}
    except Timeout:
        return {"context": []}
    except HTTPError as e:
        return {"context": [], "error": f"context_fetch_failed:{e.response.status_code}"}

3) Conditional edges point to nodes that expect missing keys

A branch may route to a node that assumes state["documents"] exists. On some paths it does not.

def summarize(state):
    # Fails if documents missing
    text = "\n".join(doc["text"] for doc in state["documents"])
    return {"summary": llm.invoke(text)}

Guard the input:

def summarize(state):
    documents = state.get("documents", [])
    if not documents:
        return {"summary": "No documents provided."}
    text = "\n".join(doc["text"] for doc in documents)
    return {"summary": llm.invoke(text)}

4) Async/sync mismatch in node functions

If your graph expects async behavior but you call blocking code inside an async node, you can get timeouts that surface as intermittent failures under load.

async def enrich(state):
    # Blocking call inside async function
    result = requests.get("https://api.example.com/data")
    return {"data": result.json()}

Use an async client:

import httpx

async def enrich(state):
    async with httpx.AsyncClient(timeout=5) as client:
        result = await client.get("https://api.example.com/data")
        result.raise_for_status()
        return {"data": result.json()}

How to Debug It

  1. Run the node functions outside LangGraph first

    • Call each node with a sample state dict.
    • If it fails standalone, the bug is in your business logic, not LangGraph.
  2. Log the exact input and output of every node

    • Print or structured-log:
      • incoming state keys
      • returned update dict
      • exceptions with stack traces
    • You want to catch InvalidUpdateError and any hidden KeyError before they become HTTP 500s.
  3. Check for invalid updates

    • Every node should return either:
      • a dict with valid keys from your schema
      • or raise a controlled exception you handle upstream
    • If you see Expected dict, got NoneType, inspect every branch in that node.
  4. Reduce concurrency and remove external calls

    • Disable parallel branches temporarily.
    • Stub LLM calls and API requests.
    • If the error disappears, the issue is likely race-related or external I/O related.

Prevention

  • Keep LangGraph state small and JSON-safe.

    • Store strings, numbers, lists, dicts.
    • Do not store sessions, clients, locks, or open file handles.
  • Make every node deterministic on bad input.

    • Use .get() defaults.
    • Return safe fallbacks instead of assuming keys exist.
  • Add validation at graph boundaries.

    • Validate incoming request payloads before invoking the graph.
    • Reject malformed inputs early instead of letting them turn into intermittent runtime failures.

If you are seeing 500 Internal Server Error from a LangGraph app wrapped in FastAPI or another web framework, treat it as a symptom. The real bug is almost always inside one node’s update contract, input assumptions, or external dependency handling.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides