How to Fix 'rate limit exceeded when scaling' in LangGraph (TypeScript)
What the error means
rate limit exceeded when scaling usually means your LangGraph app is spawning more model calls than the provider allows at once. In practice, this shows up when you add concurrency, fan-out branches, retries, or multiple agents and suddenly hit OpenAI, Anthropic, Azure OpenAI, or Bedrock limits.
In LangGraph TypeScript projects, the failure often appears as a provider 429 wrapped inside graph execution errors like GraphRecursionError, InvalidUpdateError, or a plain thrown RateLimitError from the SDK.
The Most Common Cause
The #1 cause is uncontrolled parallelism inside a graph node or across branches. People move from a single linear chain to a graph with Promise.all() or multiple edges firing at once, and every branch hits the LLM at the same time.
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Fire all requests at once | Limit concurrency or serialize calls |
import { StateGraph } from "@langchain/langgraph";
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
async function enrichMany(items: string[]) {
// Broken: every item becomes an LLM call at once
return Promise.all(
items.map(async (item) => {
const res = await llm.invoke(`Summarize this claim: ${item}`);
return res.content;
})
);
}
import pLimit from "p-limit";
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
const limit = pLimit(2); // keep concurrency under provider limits
async function enrichMany(items: string[]) {
return Promise.all(
items.map((item) =>
limit(async () => {
const res = await llm.invoke(`Summarize this claim: ${item}`);
return res.content;
})
)
);
}
If you’re using LangGraph fan-out, the same issue happens when several nodes call the model in parallel. The fix is not “add retries everywhere”; it’s to control concurrency at the node level and at the workflow level.
A practical rule: if your graph can emit N branches and each branch can call the model M times, your peak request rate is N x M. That’s how people accidentally turn a safe 5 RPS app into a 50 RPS spike.
Other Possible Causes
1. Retry storms
If you retry immediately on 429, you can make things worse. Multiple workers retry together and keep colliding with the same quota window.
// Bad: immediate retry with no jitter
for (let i = 0; i < 3; i++) {
try {
return await llm.invoke(prompt);
} catch (e) {
if (String(e).includes("rate limit")) continue;
throw e;
}
}
Use exponential backoff with jitter:
const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));
for (let i = 0; i < 3; i++) {
try {
return await llm.invoke(prompt);
} catch (e) {
if (!String(e).includes("rate limit")) throw e;
await sleep(500 * Math.pow(2, i) + Math.random() * 250);
}
}
2. Multiple graph executions sharing one quota window
This happens when you scale horizontally and each instance thinks it can use the full provider quota.
// Two pods, same API key, each running max concurrency = 10
// Effective load doubles instantly.
Fix it by setting per-instance concurrency lower than your global quota and using a shared queue if needed.
3. Streaming plus tool calls multiplying requests
A single user request can trigger:
- •one planner call
- •one tool selection call
- •one tool execution summary call
- •one final answer call
That is four provider calls for one incoming message.
const model = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
Reduce unnecessary intermediate calls, or cache deterministic steps like routing and classification.
4. Badly configured LangGraph loop/recursion behavior
If your graph keeps re-entering a node because state never changes correctly, you may see repeated model calls until rate limits hit.
import { StateGraph } from "@langchain/langgraph";
// If condition never flips, this loops and keeps calling the LLM.
Watch for state updates that do not actually mutate the fields used by your conditional edges.
How to Debug It
- •
Count actual model calls per user request
Add logging around everyllm.invoke()and every tool that internally calls an LLM. If one request produces more than you expect, you found your multiplier. - •
Check whether errors are true provider
429s
Look for messages like:- •
429 Too Many Requests - •
RateLimitError - •
openai.RateLimitError - •
anthropic.RateLimitError
If LangGraph wraps it in another error, inspect
error.cause. - •
- •
Measure concurrency at runtime
Log active requests before and after each invocation:let active = 0; async function trackedInvoke(prompt: string) { active++; console.log("active_llm_requests=", active); try { return await llm.invoke(prompt); } finally { active--; } } - •
Disable parallel branches temporarily
Run the graph in a serialized mode. If the error disappears, your issue is fan-out or worker-level concurrency, not prompt size or token count.
Prevention
- •Set explicit concurrency limits on every worker that touches an LLM.
- •Use exponential backoff with jitter for retries; never hammer immediately after a
429. - •Keep graphs deterministic where possible: fewer loops, fewer redundant LLM calls, fewer branches that all call the same model.
If you want a stable production pattern, treat rate limits as a capacity-planning problem, not just an exception-handling problem. In LangGraph TypeScript apps, most “rate limit exceeded when scaling” incidents come from request amplification hidden inside graphs.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit