How to Fix 'chain execution stuck when scaling' in LangGraph (TypeScript)
When LangGraph gets “stuck when scaling,” it usually means one of your graph runs is waiting forever on a node that never resolves, or your runtime is exhausting a shared resource under concurrent load. In TypeScript, this shows up most often when you move from one-off local runs to multiple parallel requests, worker pools, or serverless traffic.
The symptom is usually one of these:
- •The request hangs with no final output
- •A node keeps retrying or never reaches
END - •You see timeouts like
GraphRecursionError: Recursion limit of 25 reached without hitting a stop condition - •Or the app appears frozen because a promise inside a node never settles
The Most Common Cause
The #1 cause is a node that performs async work but does not return a proper value on every path, or mutates shared state in a way that causes downstream nodes to wait forever.
In LangGraph, each node must return a partial state update. If you accidentally await something that can hang, swallow an exception, or forget to return the next state, the graph can stall. This gets much worse under scale because concurrency exposes race conditions and long-tail timeouts.
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Node mutates shared object and sometimes returns nothing | Node returns explicit partial state every time |
| Promise can hang forever | Promise wrapped with timeout |
| Error swallowed, graph never advances | Error surfaced and handled deterministically |
// BROKEN
import { StateGraph, START, END } from "@langchain/langgraph";
type State = {
messages: string[];
result?: string;
};
const graph = new StateGraph<State>({
channels: {
messages: { value: (x: string[], y: string[]) => x.concat(y), default: () => [] },
result: { value: (_: string | undefined, y: string | undefined) => y },
},
});
graph.addNode("fetchData", async (state) => {
// Shared mutable state + no guaranteed return path
const data = await fetch(process.env.API_URL!).then((r) => r.text());
if (!data) {
// Swallowed failure -> downstream nodes may wait forever
return;
}
state.messages.push(data);
});
graph.addEdge(START, "fetchData");
graph.addEdge("fetchData", END);
// FIXED
import { StateGraph, START, END } from "@langchain/langgraph";
type State = {
messages: string[];
result?: string;
};
const withTimeout = <T>(p: Promise<T>, ms = 5000) =>
Promise.race([
p,
new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error(`timeout after ${ms}ms`)), ms)
),
]);
const graph = new StateGraph<State>({
channels: {
messages: { value: (x: string[], y: string[]) => x.concat(y), default: () => [] },
result: { value: (_: string | undefined, y: string | undefined) => y },
},
});
graph.addNode("fetchData", async (_state) => {
const res = await withTimeout(fetch(process.env.API_URL!), 5000);
const data = await res.text();
return {
messages: [data],
result: data,
};
});
graph.addEdge(START, "fetchData");
graph.addEdge("fetchData", END);
The key fix is simple:
- •Never mutate shared state inside a node
- •Always return a partial update
- •Put timeouts around external calls
- •Let failures fail fast instead of hanging the run
Other Possible Causes
1. Recursion or cycle in the graph
If your conditional edges keep routing back into the same node without a stop condition, you’ll hit:
GraphRecursionError: Recursion limit of 25 reached without hitting a stop condition
graph.addConditionalEdges("router", (state) =>
state.needsMoreWork ? "router" : END
);
Fix it by adding an explicit counter in state:
type State = { attempts: number; needsMoreWork: boolean };
graph.addConditionalEdges("router", (state) =>
state.attempts >= 3 ? END : "router"
);
2. Non-serializable or oversized state
When scaling across workers or persistence layers, passing huge objects or non-serializable values causes stalls or storage failures.
// Bad
return {
rawResponse,
socket,
};
Use small serializable state only:
// Good
return {
responseText: rawResponse.text.slice(0, 2000),
};
3. Shared singleton client with connection exhaustion
If every request reuses one badly configured client pool, concurrent runs can queue indefinitely.
const client = new SomeApiClient({ maxConnections: 1 });
Increase pool size or create per-request clients where appropriate:
const client = new SomeApiClient({
maxConnections: Number(process.env.MAX_CONNECTIONS ?? "10"),
});
4. Missing await on async node internals
This creates “fake success” where the graph advances before work finishes.
graph.addNode("save", async (state) => {
db.write(state.result); // missing await
return { resultSaved: true };
});
Fix it:
graph.addNode("save", async (state) => {
await db.write(state.result!);
return { resultSaved: true };
});
How to Debug It
- •
Turn on node-level logging
- •Log entry and exit for every node.
- •If you see entry without exit, that node is your stall point.
- •
Add hard timeouts to all external calls
- •API requests
- •DB queries
- •Vector store calls
- •Tool execution
- •
Inspect the last emitted state
- •Check whether the graph is returning
undefined - •Check for missing fields required by conditional edges
- •Check whether the graph is returning
- •
Reduce concurrency to one
- •If the bug disappears at concurrency
1, you likely have:- •shared mutable state
- •connection pool starvation
- •race conditions in conditional routing
- •If the bug disappears at concurrency
A practical pattern for tracing is this:
graph.addNode("myNode", async (state) => {
console.log("[myNode] input", JSON.stringify(state));
const nextState = await doWork(state);
console.log("[myNode] output", JSON.stringify(nextState));
return nextState;
});
If input logs appear but output logs do not, the hang is inside doWork. If output logs appear but execution still stalls, check your edges and stop conditions.
Prevention
- •Keep LangGraph state small, serializable, and deterministic.
- •Return a partial update from every node on every path.
- •Wrap all network and database calls with explicit timeouts.
- •Add an attempt counter for loops and retries.
- •Test graphs under concurrency before shipping them to production.
If you’re seeing chain execution stuck when scaling, start with the node that talks to the outside world. In real systems, that’s usually where the hang begins.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit