How to Fix 'rate limit exceeded when scaling' in AutoGen (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-when-scalingautogentypescript

When AutoGen throws rate limit exceeded when scaling, it usually means your app is creating more model calls than the provider allows in a short window. In TypeScript, this often shows up when you scale from a single agent flow to multiple concurrent agents, parallel tool calls, or repeated retries without backoff.

The fix is usually not “increase the limit” — it’s to stop bursty request patterns, reuse clients correctly, and throttle concurrency where AutoGen fans out work.

The Most Common Cause

The #1 cause is uncontrolled concurrency. A common mistake is kicking off multiple AssistantAgent runs at once with Promise.all, which multiplies token requests and trips provider rate limits fast.

Broken vs fixed pattern

Broken patternFixed pattern
Fire all agent runs at onceLimit concurrency and queue requests
Create new model clients per taskReuse one client/config
No retry/backoffAdd retry with jitter
// BROKEN: bursty fan-out
import { AssistantAgent } from "@autogen/core";

const agents = inputs.map(
  (input) =>
    new AssistantAgent({
      name: `agent-${input.id}`,
      modelClient,
      systemMessage: "You are a helpful assistant.",
    })
);

const results = await Promise.all(
  agents.map((agent, i) => agent.run([{ role: "user", content: inputs[i].text }]))
);

// Typical failure:
// Error: rate limit exceeded when scaling
// at OpenAIChatCompletionClient.create(...)
// FIXED: throttle concurrency
import pLimit from "p-limit";
import { AssistantAgent } from "@autogen/core";

const limit = pLimit(2); // tune this to your provider quota

const results = await Promise.all(
  inputs.map((input) =>
    limit(async () => {
      const agent = new AssistantAgent({
        name: `agent-${input.id}`,
        modelClient,
        systemMessage: "You are a helpful assistant.",
      });

      return agent.run([{ role: "user", content: input.text }]);
    })
  )
);

If you’re using an AutoGen group chat or orchestrator that fans out tasks internally, the same rule applies: fewer simultaneous model calls. The error is often triggered by the aggregate request rate, not one single request.

Other Possible Causes

1. Creating a new OpenAIChatCompletionClient for every request

This looks harmless, but it can multiply connection setup and make retries harder to control.

// BAD
for (const task of tasks) {
  const modelClient = new OpenAIChatCompletionClient({ model: "gpt-4o-mini" });
  await agent.run([{ role: "user", content: task }]);
}
// GOOD
const modelClient = new OpenAIChatCompletionClient({ model: "gpt-4o-mini" });

for (const task of tasks) {
  await agent.run([{ role: "user", content: task }]);
}

2. Retry loops without backoff

If your wrapper retries immediately after a 429, you’ll keep hammering the API.

// BAD
try {
  await agent.run(messages);
} catch (e) {
  await agent.run(messages); // immediate retry
}
// GOOD
async function sleep(ms: number) {
  return new Promise((r) => setTimeout(r, ms));
}

for (let attempt = 0; attempt < 3; attempt++) {
  try {
    return await agent.run(messages);
  } catch (e) {
    if (attempt === 2) throw e;
    await sleep(500 * Math.pow(2, attempt));
  }
}

3. Too many tool calls inside one turn

AutoGen agents that call tools like search, retrieval, or database lookups can trigger multiple downstream LLM calls per user message.

// Example config that can amplify traffic
const agent = new AssistantAgent({
  name: "researcher",
  modelClient,
  tools: [webSearchTool, ragTool, summarizerTool],
});

If each tool result triggers another reasoning step, one user input can become five or ten API requests. Reduce tool count or split the workflow into stages.

4. Misconfigured max turns or recursive handoffs

A group chat that keeps handing off between agents can create an accidental loop.

const team = new RoundRobinGroupChat({
  participants: [planner, coder, reviewer],
  maxTurns: 50,
});

If the conversation does not converge quickly, lower maxTurns, add termination conditions, or insert explicit state checks before re-running the loop.

How to Debug It

  1. Count actual outbound model calls

    • Log every modelClient.create(...) or equivalent wrapper call.
    • If one user action creates many calls, you have fan-out somewhere in your orchestration.
  2. Check whether failures happen only under parallel load

    • Run one request at a time.
    • Then test with Promise.all disabled.
    • If the error disappears serially, concurrency is your problem.
  3. Inspect retry behavior

    • Search for any wrapper that catches 429, rate limit exceeded, or provider-specific throttling errors.
    • Make sure retries use exponential backoff and stop after a small number of attempts.
  4. Review AutoGen flow topology

    • Look at agent handoffs, tool chains, and group chat loops.
    • A single prompt may trigger planner → executor → reviewer → planner recursion if termination is weak.

Prevention

  • Reuse one OpenAIChatCompletionClient per process or per tenant instead of creating clients in tight loops.
  • Put a hard concurrency cap on agent runs with something like p-limit.
  • Add exponential backoff with jitter for any retry path that handles 429 or rate limit exceeded.

If you’re scaling AutoGen in TypeScript across multiple agents, assume burst traffic by default. Design for queueing first, then increase throughput only after you’ve measured real provider limits and actual call volume.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides