How to Fix 'rate limit exceeded when scaling' in LangGraph (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-when-scalinglanggraphtypescript

What the error means

rate limit exceeded when scaling usually means your LangGraph app is spawning more model calls than the provider allows at once. In practice, this shows up when you add concurrency, fan-out branches, retries, or multiple agents and suddenly hit OpenAI, Anthropic, Azure OpenAI, or Bedrock limits.

In LangGraph TypeScript projects, the failure often appears as a provider 429 wrapped inside graph execution errors like GraphRecursionError, InvalidUpdateError, or a plain thrown RateLimitError from the SDK.

The Most Common Cause

The #1 cause is uncontrolled parallelism inside a graph node or across branches. People move from a single linear chain to a graph with Promise.all() or multiple edges firing at once, and every branch hits the LLM at the same time.

Here’s the broken pattern:

BrokenFixed
Fire all requests at onceLimit concurrency or serialize calls
import { StateGraph } from "@langchain/langgraph";
import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({ model: "gpt-4o-mini" });

async function enrichMany(items: string[]) {
  // Broken: every item becomes an LLM call at once
  return Promise.all(
    items.map(async (item) => {
      const res = await llm.invoke(`Summarize this claim: ${item}`);
      return res.content;
    })
  );
}
import pLimit from "p-limit";
import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
const limit = pLimit(2); // keep concurrency under provider limits

async function enrichMany(items: string[]) {
  return Promise.all(
    items.map((item) =>
      limit(async () => {
        const res = await llm.invoke(`Summarize this claim: ${item}`);
        return res.content;
      })
    )
  );
}

If you’re using LangGraph fan-out, the same issue happens when several nodes call the model in parallel. The fix is not “add retries everywhere”; it’s to control concurrency at the node level and at the workflow level.

A practical rule: if your graph can emit N branches and each branch can call the model M times, your peak request rate is N x M. That’s how people accidentally turn a safe 5 RPS app into a 50 RPS spike.

Other Possible Causes

1. Retry storms

If you retry immediately on 429, you can make things worse. Multiple workers retry together and keep colliding with the same quota window.

// Bad: immediate retry with no jitter
for (let i = 0; i < 3; i++) {
  try {
    return await llm.invoke(prompt);
  } catch (e) {
    if (String(e).includes("rate limit")) continue;
    throw e;
  }
}

Use exponential backoff with jitter:

const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));

for (let i = 0; i < 3; i++) {
  try {
    return await llm.invoke(prompt);
  } catch (e) {
    if (!String(e).includes("rate limit")) throw e;
    await sleep(500 * Math.pow(2, i) + Math.random() * 250);
  }
}

2. Multiple graph executions sharing one quota window

This happens when you scale horizontally and each instance thinks it can use the full provider quota.

// Two pods, same API key, each running max concurrency = 10
// Effective load doubles instantly.

Fix it by setting per-instance concurrency lower than your global quota and using a shared queue if needed.

3. Streaming plus tool calls multiplying requests

A single user request can trigger:

  • one planner call
  • one tool selection call
  • one tool execution summary call
  • one final answer call

That is four provider calls for one incoming message.

const model = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

Reduce unnecessary intermediate calls, or cache deterministic steps like routing and classification.

4. Badly configured LangGraph loop/recursion behavior

If your graph keeps re-entering a node because state never changes correctly, you may see repeated model calls until rate limits hit.

import { StateGraph } from "@langchain/langgraph";

// If condition never flips, this loops and keeps calling the LLM.

Watch for state updates that do not actually mutate the fields used by your conditional edges.

How to Debug It

  1. Count actual model calls per user request
    Add logging around every llm.invoke() and every tool that internally calls an LLM. If one request produces more than you expect, you found your multiplier.

  2. Check whether errors are true provider 429s
    Look for messages like:

    • 429 Too Many Requests
    • RateLimitError
    • openai.RateLimitError
    • anthropic.RateLimitError

    If LangGraph wraps it in another error, inspect error.cause.

  3. Measure concurrency at runtime
    Log active requests before and after each invocation:

    let active = 0;
    
    async function trackedInvoke(prompt: string) {
      active++;
      console.log("active_llm_requests=", active);
      try {
        return await llm.invoke(prompt);
      } finally {
        active--;
      }
    }
    
  4. Disable parallel branches temporarily
    Run the graph in a serialized mode. If the error disappears, your issue is fan-out or worker-level concurrency, not prompt size or token count.

Prevention

  • Set explicit concurrency limits on every worker that touches an LLM.
  • Use exponential backoff with jitter for retries; never hammer immediately after a 429.
  • Keep graphs deterministic where possible: fewer loops, fewer redundant LLM calls, fewer branches that all call the same model.

If you want a stable production pattern, treat rate limits as a capacity-planning problem, not just an exception-handling problem. In LangGraph TypeScript apps, most “rate limit exceeded when scaling” incidents come from request amplification hidden inside graphs.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides