How to Fix 'deployment crash when scaling' in LangGraph (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
deployment-crash-when-scalinglanggraphtypescript

When a LangGraph deployment crashes only after you scale replicas, it usually means your graph is holding state or resources in a way that works on one process but breaks when requests land on multiple workers. In TypeScript, the failure often shows up as intermittent TypeError, missing thread state, duplicated writes, or a startup crash when the graph is rebuilt per request.

The pattern is almost always the same: something that should be shared, persisted, or initialized once is being created inside request scope or stored in memory on a single pod.

The Most Common Cause

The #1 cause is in-memory state used as if it were durable shared state.

With LangGraph, this usually means you are using MemorySaver or local variables in a deployment that scales horizontally. It works on one instance because all calls hit the same memory space. Once Kubernetes or your platform sends traffic to another replica, the graph “forgets” the thread and you get errors like:

  • Error: No checkpoint found for thread_id=...
  • TypeError: Cannot read properties of undefined
  • InvalidUpdateError: Expected state keys ...
  • GraphRecursionError when retries re-run with lost context

Broken vs fixed pattern

BrokenFixed
Uses process memory for checkpointsUses a persistent checkpointer
Recreates graph per requestCreates graph once at startup
Assumes one pod = one appAssumes many pods = shared storage
// BROKEN: MemorySaver dies with the process
import { StateGraph, MemorySaver } from "@langchain/langgraph";

const checkpointer = new MemorySaver();

export async function handler(req: Request) {
  const graph = buildGraph(checkpointer); // recreated per request
  return graph.invoke(
    { messages: [{ role: "user", content: "hi" }] },
    { configurable: { thread_id: req.headers.get("x-thread-id") } }
  );
}
// FIXED: persistent checkpointer + singleton graph
import { StateGraph } from "@langchain/langgraph";
import { SqliteSaver } from "@langchain/langgraph-checkpoint-sqlite";

const checkpointer = await SqliteSaver.fromConnString(process.env.CHECKPOINT_DB!);
const graph = buildGraph(checkpointer); // create once at module load

export async function handler(req: Request) {
  return graph.invoke(
    { messages: [{ role: "user", content: "hi" }] },
    { configurable: { thread_id: req.headers.get("x-thread-id")! } }
  );
}

If you are deploying behind multiple replicas, use Postgres, Redis-backed storage, or another shared persistence layer. Local memory is fine for local dev and unit tests, not for scaled production traffic.

Other Possible Causes

1. Graph is built inside the request path

If your app rebuilds the graph on every request, scaling amplifies startup cost and can expose race conditions.

// BAD
app.post("/chat", async (req, res) => {
  const graph = createGraph(); // expensive and repeated
  res.json(await graph.invoke(input));
});
// GOOD
const graph = createGraph();

app.post("/chat", async (req, res) => {
  res.json(await graph.invoke(input));
});

2. Non-serializable values in state

LangGraph checkpoints must serialize cleanly. Putting class instances, functions, DB clients, or circular objects into state can crash under checkpointing.

// BAD
const update = {
  dbClient,
  messages,
};
// GOOD
const update = {
  messages,
};
// Keep dbClient outside graph state; inject it through closures/services.

3. Missing or unstable thread_id

If each replica generates a different thread ID or the client doesn’t send one consistently, LangGraph treats every call as a new conversation.

// BAD
configurable: { thread_id: crypto.randomUUID() }
// GOOD
configurable: { thread_id: req.headers.get("x-thread-id")! }

Use a stable ID from your app domain:

  • user session ID
  • conversation ID
  • case ID
  • claim ID

4. Concurrent writes to the same thread

Two requests updating the same thread at once can produce conflicts or inconsistent state.

// Example symptom:
// Error: Concurrent update detected for thread_id=abc123

Fix this by:

  • serializing writes per thread
  • using a queue/lock around updates
  • avoiding parallel .invoke() calls for the same thread_id

How to Debug It

  1. Check whether the crash only happens after scaling

    • If it works with one replica and fails with two or more, suspect in-memory state first.
    • Log pod name and thread ID on every request.
  2. Inspect the exact error text

    • No checkpoint found points to storage/thread mismatch.
    • Cannot read properties of undefined often means missing prior state.
    • Concurrently modified or similar points to race conditions.
  3. Verify where the graph is instantiated

    • If you see new StateGraph(...) inside route handlers or serverless handlers, move it to module scope.
    • Ensure checkpointers are not recreated per request unless they connect to shared persistence.
  4. Temporarily swap in a persistent checkpointer

    • Replace MemorySaver with SQLite/Postgres/Redis-backed storage.
    • If the issue disappears immediately, you’ve confirmed the root cause.

Prevention

  • Create the LangGraph app once at process startup, not inside each handler.
  • Use shared persistence for checkpoints in any horizontally scaled deployment.
  • Keep graph state JSON-serializable; store services and clients outside state.
  • Make thread_id stable and deterministic across retries and replicas.

If you want a quick rule of thumb: if your deployment has more than one worker and your graph depends on memory that lives inside one worker, it will eventually crash or behave inconsistently. Fix persistence first, then look at concurrency and serialization.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides