How to Fix 'deployment crash when scaling' in LangGraph (Python)

By Cyprian AaronsUpdated 2026-04-21
deployment-crash-when-scalinglanggraphpython

When a LangGraph app crashes only after you scale it, the problem is usually not the graph logic itself. It’s almost always a state, serialization, or runtime mismatch that only shows up once you run multiple workers, multiple replicas, or async execution under load.

The common symptom is something like:

  • RuntimeError: Event loop is closed
  • TypeError: Object of type ChatOpenAI is not JSON serializable
  • langgraph.errors.InvalidUpdateError: Expected dict, got ...

You’ll usually see it when moving from local single-process runs to Docker, Kubernetes, Gunicorn/Uvicorn workers, or serverless deployments.

The Most Common Cause

The #1 cause is putting non-serializable objects into graph state.

LangGraph state should contain plain data: strings, numbers, lists, dicts. If you store things like ChatOpenAI, database clients, open file handles, or compiled chains in state, it may work locally and then fail when the app scales and tries to checkpoint, copy, or serialize state between workers.

Broken vs fixed pattern

Broken patternFixed pattern
Store live Python objects in stateStore only serializable config/data
Instantiate clients inside node logic and write them to statePass clients via closures/dependency injection
Let checkpoints try to serialize everythingKeep runtime objects outside graph state
# BROKEN
from typing import TypedDict
from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI

class State(TypedDict):
    user_input: str
    llm: object  # bad: non-serializable runtime object

def call_model(state: State):
    model = state["llm"]
    response = model.invoke(state["user_input"])
    return {"result": response.content}

builder = StateGraph(State)
builder.add_node("call_model", call_model)
builder.set_entry_point("call_model")
graph = builder.compile()

This tends to blow up under scaling with errors like:

  • TypeError: Object of type ChatOpenAI is not JSON serializable
  • TypeError: cannot pickle '_thread.RLock' object
  • checkpointing failures when using Redis/Postgres backends
# FIXED
from typing import TypedDict
from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI

class State(TypedDict):
    user_input: str
    result: str

llm = ChatOpenAI(model="gpt-4o-mini")  # keep outside state

def call_model(state: State):
    response = llm.invoke(state["user_input"])
    return {"result": response.content}

builder = StateGraph(State)
builder.add_node("call_model", call_model)
builder.set_entry_point("call_model")
graph = builder.compile()

If you need per-request dependencies, pass them through a factory function or closure. Don’t stash them in the graph state.

Other Possible Causes

1) Shared mutable globals across workers

If your nodes mutate module-level variables, one worker can corrupt another. This often appears as inconsistent output or crashes under concurrency.

# BAD
cache = []

def node(state):
    cache.append(state["user_input"])
    return {"result": len(cache)}

Fix it by keeping request-specific data in state or using a proper external cache.

def node(state):
    seen = list(state.get("seen", []))
    seen.append(state["user_input"])
    return {"seen": seen}

2) Async/sync mismatch in deployment

A graph node defined as async def but called from a sync runtime can trigger event loop errors during scaling.

Common messages:

  • RuntimeError: Event loop is closed
  • RuntimeError: asyncio.run() cannot be called from a running event loop
# BAD
async def fetch_data(state):
    ...

If your app is sync end-to-end, keep nodes sync. If your stack is async, compile and invoke it with the async path consistently.

# GOOD
async def fetch_data(state):
    data = await some_async_client.get(...)
    return {"data": data}

result = await graph.ainvoke({"query": "hello"})

3) Checkpointer/backend not configured for multi-worker use

A memory checkpointer works locally but fails once you have multiple replicas because each pod has its own memory. You’ll see lost state or inconsistent resumes.

# BAD for scaling
from langgraph.checkpoint.memory import MemorySaver

checkpointer = MemorySaver()
graph = builder.compile(checkpointer=checkpointer)

Use a shared persistent backend instead:

# GOOD idea: shared storage-backed checkpointer
graph = builder.compile(checkpointer=your_postgres_checkpointer)

If you’re on Kubernetes or ECS, memory-based persistence is not enough.

4) Returning invalid updates from nodes

LangGraph expects node outputs to match the declared state shape. Returning a string or a custom object can fail with:

  • langgraph.errors.InvalidUpdateError: Expected dict, got ...
# BAD
def node(state):
    return "done"

Return a dict keyed by valid state fields:

# GOOD
def node(state):
    return {"status": "done"}

How to Debug It

  1. Check whether the crash happens only after scaling

    • Run the same code locally with one worker.
    • Then run with multiple workers:
      uvicorn app:app --workers 4
      
    • If it only fails here, suspect shared globals, serialization, or async issues.
  2. Inspect what your nodes return

    • Every node should return a plain dict.
    • Log the exact payload before returning it.
    • Look for objects like models, DB connections, coroutines, generators, or class instances.
  3. Remove all non-state dependencies from graph state

    • Search for keys like llm, client, session, db, engine, or chain.
    • Anything that cannot be JSON serialized should stay outside the LangGraph state object.
  4. Test with a minimal graph

    • Reduce your graph to one node and one field.
    • Add nodes back until it breaks.
    • The last added node usually contains the bad return value or shared dependency.

Prevention

  • Keep LangGraph state strictly serializable:
    • strings, ints, floats, bools, lists, dicts only.
  • Use dependency injection for runtime objects:
    • create LLM clients, DB sessions, and HTTP clients outside the state.
  • Match your execution model:
    • don’t mix sync deployment code with async nodes unless you’re invoking them correctly end-to-end.
  • Use persistent checkpointing if you scale beyond one process:
    • memory checkpointers are fine for dev; they are not fine for production replicas.

If you’re seeing deployment crash when scaling in LangGraph Python, start by checking your state shape. In most real deployments I’ve debugged, the bug was not LangGraph itself — it was something non-serializable leaking into graph state or a worker/runtime mismatch exposed by concurrency.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides