How to Fix 'deployment crash when scaling' in LangGraph (TypeScript)
When a LangGraph deployment crashes only after you scale replicas, it usually means your graph is holding state or resources in a way that works on one process but breaks when requests land on multiple workers. In TypeScript, the failure often shows up as intermittent TypeError, missing thread state, duplicated writes, or a startup crash when the graph is rebuilt per request.
The pattern is almost always the same: something that should be shared, persisted, or initialized once is being created inside request scope or stored in memory on a single pod.
The Most Common Cause
The #1 cause is in-memory state used as if it were durable shared state.
With LangGraph, this usually means you are using MemorySaver or local variables in a deployment that scales horizontally. It works on one instance because all calls hit the same memory space. Once Kubernetes or your platform sends traffic to another replica, the graph “forgets” the thread and you get errors like:
- •
Error: No checkpoint found for thread_id=... - •
TypeError: Cannot read properties of undefined - •
InvalidUpdateError: Expected state keys ... - •
GraphRecursionErrorwhen retries re-run with lost context
Broken vs fixed pattern
| Broken | Fixed |
|---|---|
| Uses process memory for checkpoints | Uses a persistent checkpointer |
| Recreates graph per request | Creates graph once at startup |
| Assumes one pod = one app | Assumes many pods = shared storage |
// BROKEN: MemorySaver dies with the process
import { StateGraph, MemorySaver } from "@langchain/langgraph";
const checkpointer = new MemorySaver();
export async function handler(req: Request) {
const graph = buildGraph(checkpointer); // recreated per request
return graph.invoke(
{ messages: [{ role: "user", content: "hi" }] },
{ configurable: { thread_id: req.headers.get("x-thread-id") } }
);
}
// FIXED: persistent checkpointer + singleton graph
import { StateGraph } from "@langchain/langgraph";
import { SqliteSaver } from "@langchain/langgraph-checkpoint-sqlite";
const checkpointer = await SqliteSaver.fromConnString(process.env.CHECKPOINT_DB!);
const graph = buildGraph(checkpointer); // create once at module load
export async function handler(req: Request) {
return graph.invoke(
{ messages: [{ role: "user", content: "hi" }] },
{ configurable: { thread_id: req.headers.get("x-thread-id")! } }
);
}
If you are deploying behind multiple replicas, use Postgres, Redis-backed storage, or another shared persistence layer. Local memory is fine for local dev and unit tests, not for scaled production traffic.
Other Possible Causes
1. Graph is built inside the request path
If your app rebuilds the graph on every request, scaling amplifies startup cost and can expose race conditions.
// BAD
app.post("/chat", async (req, res) => {
const graph = createGraph(); // expensive and repeated
res.json(await graph.invoke(input));
});
// GOOD
const graph = createGraph();
app.post("/chat", async (req, res) => {
res.json(await graph.invoke(input));
});
2. Non-serializable values in state
LangGraph checkpoints must serialize cleanly. Putting class instances, functions, DB clients, or circular objects into state can crash under checkpointing.
// BAD
const update = {
dbClient,
messages,
};
// GOOD
const update = {
messages,
};
// Keep dbClient outside graph state; inject it through closures/services.
3. Missing or unstable thread_id
If each replica generates a different thread ID or the client doesn’t send one consistently, LangGraph treats every call as a new conversation.
// BAD
configurable: { thread_id: crypto.randomUUID() }
// GOOD
configurable: { thread_id: req.headers.get("x-thread-id")! }
Use a stable ID from your app domain:
- •user session ID
- •conversation ID
- •case ID
- •claim ID
4. Concurrent writes to the same thread
Two requests updating the same thread at once can produce conflicts or inconsistent state.
// Example symptom:
// Error: Concurrent update detected for thread_id=abc123
Fix this by:
- •serializing writes per thread
- •using a queue/lock around updates
- •avoiding parallel
.invoke()calls for the samethread_id
How to Debug It
- •
Check whether the crash only happens after scaling
- •If it works with one replica and fails with two or more, suspect in-memory state first.
- •Log pod name and thread ID on every request.
- •
Inspect the exact error text
- •
No checkpoint foundpoints to storage/thread mismatch. - •
Cannot read properties of undefinedoften means missing prior state. - •
Concurrently modifiedor similar points to race conditions.
- •
- •
Verify where the graph is instantiated
- •If you see
new StateGraph(...)inside route handlers or serverless handlers, move it to module scope. - •Ensure checkpointers are not recreated per request unless they connect to shared persistence.
- •If you see
- •
Temporarily swap in a persistent checkpointer
- •Replace
MemorySaverwith SQLite/Postgres/Redis-backed storage. - •If the issue disappears immediately, you’ve confirmed the root cause.
- •Replace
Prevention
- •Create the LangGraph app once at process startup, not inside each handler.
- •Use shared persistence for checkpoints in any horizontally scaled deployment.
- •Keep graph state JSON-serializable; store services and clients outside state.
- •Make
thread_idstable and deterministic across retries and replicas.
If you want a quick rule of thumb: if your deployment has more than one worker and your graph depends on memory that lives inside one worker, it will eventually crash or behave inconsistently. Fix persistence first, then look at concurrency and serialization.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit