How to Fix 'deployment crash in production' in LangGraph (Python)
What this error usually means
If you’re seeing deployment crash in production with LangGraph, the graph is usually failing during startup or on the first request after deployment. In practice, this is almost always a Python import/runtime issue, a bad graph construction pattern, or a state/checkpoint mismatch that only shows up once the app is packaged and run in a real environment.
The key thing: LangGraph itself is rarely the root cause. The crash is usually triggered by your app code, then surfaced by your platform as a deployment failure.
The Most Common Cause
The #1 cause I see is building the graph with side effects at import time, then deploying it into an environment where dependencies, env vars, or compiled objects are not ready yet.
Typical symptoms include:
- •
ImportError: cannot import name ... - •
KeyError: 'OPENAI_API_KEY' - •
TypeError: StateGraph.__init__() missing ... - •
langgraph.errors.GraphRecursionErrorduring first execution because the graph was wired incorrectly
Here’s the broken pattern:
# broken.py
from langgraph.graph import StateGraph, END
from my_app.llm import llm # may fail in prod if env isn't ready
builder = StateGraph(dict)
# side effect at import time
model_name = llm.model_name # can crash if llm isn't initialized
builder.add_node("agent", lambda state: {"messages": llm.invoke(state["messages"])})
builder.set_entry_point("agent")
builder.add_edge("agent", END)
graph = builder.compile()
And here’s the fixed pattern:
# fixed.py
from langgraph.graph import StateGraph, END
def build_graph(llm):
builder = StateGraph(dict)
def agent_node(state):
result = llm.invoke(state["messages"])
return {"messages": result}
builder.add_node("agent", agent_node)
builder.set_entry_point("agent")
builder.add_edge("agent", END)
return builder.compile()
| Broken | Fixed |
|---|---|
| Graph compiled at module import | Graph built inside a function |
| Depends on runtime state before app is ready | Dependencies injected explicitly |
| Hard to test and hard to deploy | Deterministic startup path |
In production, this matters because your container imports modules before your app framework has loaded secrets, connected to services, or patched environment variables. If anything in that import chain throws, your deployment crashes.
Other Possible Causes
1) Missing or invalid environment variables
LangGraph apps often depend on model providers or stores that need secrets at runtime.
# broken
import os
api_key = os.environ["OPENAI_API_KEY"] # KeyError in prod if missing
Fix it by validating early:
import os
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise RuntimeError("OPENAI_API_KEY is required")
2) Checkpointer configured incorrectly
If you use persistence with MemorySaver, Postgres, Redis, or another checkpointer, a bad config can crash startup or the first invoke.
from langgraph.checkpoint.memory import MemorySaver
checkpointer = MemorySaver() # fine locally
graph = builder.compile(checkpointer=checkpointer)
But if your deployed code expects persistent threads and you swap backends without matching schema/config, you may get errors like:
- •
ValueError: Checkpointer requires thread_id - •
OperationalErrorfrom your database driver - •
langgraph.errors.InvalidUpdateError
Make sure your invoke includes thread metadata when required:
config = {"configurable": {"thread_id": "prod-thread-123"}}
result = graph.invoke({"messages": []}, config=config)
3) Wrong node return shape
LangGraph nodes must return updates that match your state schema. Returning raw strings or malformed dicts can blow up at runtime.
# broken
def agent_node(state):
return "hello"
Fix it by returning a valid state update:
# fixed
def agent_node(state):
return {"messages": [{"role": "assistant", "content": "hello"}]}
If you’re using typed state with TypedDict or Pydantic models, make sure every node returns fields that conform to that schema.
4) Recursive loop with no exit condition
A graph can look fine in code and still fail in production with:
- •
langgraph.errors.GraphRecursionError: Recursion limit of 25 reached without hitting a stop condition
Broken example:
builder.add_edge("agent", "agent") # infinite loop
Fixed example:
builder.add_conditional_edges(
"agent",
route_fn,
{"continue": "agent", "end": END},
)
If your router never returns "end", the graph will keep executing until it hits the recursion limit.
How to Debug It
- •
Check the actual Python traceback
- •Don’t stop at “deployment crash”.
- •Look for the first real exception:
ImportError,KeyError,ValidationError,GraphRecursionError, or database errors. - •The top-level platform message is usually just a wrapper.
- •
Run the exact same entrypoint locally
- •Use the same Python version, same dependencies, same env vars.
- •Start with:
python -m your_app.main - •If it fails locally, you’ve narrowed it down to app code instead of infra.
- •
Print graph construction boundaries
- •Add logs before and after building/compiling:
print("building graph") graph = build_graph(llm) print("graph built") - •If “building graph” prints but “graph built” does not, the failure is inside compile-time setup.
- •Add logs before and after building/compiling:
- •
Validate inputs and config before invoke
- •Confirm required keys exist:
assert "messages" in payload assert thread_id is not None - •For stateful graphs, confirm your runtime config includes what the checkpointer expects.
- •Confirm required keys exist:
Prevention
- •Build graphs inside functions, not at module import time.
- •Validate env vars and provider config before compiling or invoking.
- •Add one integration test that runs
compile()plus one real.invoke()against production-like input. - •If you use checkpoints, always test with the same backend you deploy with.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit