How to Fix 'rate limit exceeded in production' in AutoGen (TypeScript)

By Cyprian AaronsUpdated 2026-04-21

rate-limit-exceeded-in-productionautogentypescript

If you’re seeing rate limit exceeded in production with AutoGen in TypeScript, the model provider is rejecting requests because your app is sending too many tokens, too many calls, or both. In practice, this shows up when you move from local testing to real traffic and your agent starts looping, spawning parallel calls, or retrying aggressively.

The error usually comes from the underlying OpenAI-compatible client, not AutoGen itself. The stack trace often includes 429 Too Many Requests, RateLimitError, or an SDK-specific message like You exceeded your current quota.

The Most Common Cause

The #1 cause is uncontrolled agent loops or repeated tool calls inside a chat session. In AutoGen TypeScript, this often happens when maxTurns is too high, the agent keeps re-asking the model for clarification, or your orchestration code retries on every failure without backoff.

Here’s the broken pattern:

Broken	Fixed
Keeps calling the model until it “looks done”	Caps turns and adds stop conditions
Retries immediately on 429	Uses exponential backoff
No request throttling	Serializes or limits concurrent runs

// ❌ Broken: unbounded loop + aggressive retries
import { AssistantAgent } from "@autogen/agent";
import { OpenAIChatCompletionClient } from "@autogen/openai";

const modelClient = new OpenAIChatCompletionClient({
  model: "gpt-4o-mini",
  apiKey: process.env.OPENAI_API_KEY!,
});

const agent = new AssistantAgent({
  name: "support_agent",
  modelClient,
});

async function runTicket(ticket: string) {
  let result;
  while (true) {
    result = await agent.run([{ role: "user", content: ticket }]);
    if (result.messages.at(-1)?.content?.includes("done")) break;
  }
  return result;
}

// ✅ Fixed: bounded turns + backoff-friendly orchestration
import { AssistantAgent } from "@autogen/agent";
import { OpenAIChatCompletionClient } from "@autogen/openai";

const modelClient = new OpenAIChatCompletionClient({
  model: "gpt-4o-mini",
  apiKey: process.env.OPENAI_API_KEY!,
});

const agent = new AssistantAgent({
  name: "support_agent",
  modelClient,
});

async function runTicket(ticket: string) {
  return await agent.run(
    [{ role: "user", content: ticket }],
    {
      maxTurns: 3,
    }
  );
}

If you’re wrapping agent.run() yourself, do not spin forever waiting for a “good enough” answer. Set explicit turn limits and fail closed when the agent can’t converge.

Other Possible Causes

1) Too much concurrency

If your service handles multiple user requests at once and each request spawns several AutoGen runs, you can blow through RPM limits fast.

// Bad: firehose concurrency
await Promise.all(requests.map((r) => agent.run(r.messages)));

Use a queue or a concurrency limiter:

// Better: limit parallel runs
import pLimit from "p-limit";

const limit = pLimit(2);

await Promise.all(
  requests.map((r) => limit(() => agent.run(r.messages)))
);

2) Large prompts and context bloat

A long chat history increases token usage per call. In production, this can hit TPM limits even if request count looks fine.

// Bad: send full history forever
const messages = [...conversationHistory, userMessage];

Trim aggressively:

// Better: keep only relevant context
const messages = conversationHistory.slice(-8);

3) Retry storms from middleware

A common pattern is retrying every failed call instantly. If the provider returns 429, your app just hammers it harder.

// Bad: immediate retry loop
for (let i = 0; i < 5; i++) {
  try {
    return await agent.run(input);
  } catch (e) {}
}

Use exponential backoff with jitter:

const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));

for (let i = 0; i < 5; i++) {
  try {
    return await agent.run(input);
  } catch (e: any) {
    if (!String(e?.message).includes("429")) throw e;
    await sleep((2 ** i) * 500 + Math.random() * 250);
  }
}

4) Wrong model or quota settings

Sometimes the issue is not AutoGen at all. You may be pointing to a smaller quota tier, wrong org/project, or a deployment with stricter limits.

const client = new OpenAIChatCompletionClient({
  model: "gpt-4o-mini",
  apiKey: process.env.OPENAI_API_KEY!,
});

Check:

•API key belongs to the right project/org
•Billing is active
•Model name matches what your account can use
•Azure/OpenAI endpoint config is correct if you’re not using plain OpenAI

How to Debug It

•
Inspect the exact exception Look for 429 Too Many Requests, RateLimitError, or provider text like You exceeded your current quota. If you only see a generic wrapper error from AutoGen, log the full nested error object.
•
Measure request rate and token usage Count how many times agent.run() fires per user action. Also log prompt size and response size so you can tell whether RPM or TPM is the real bottleneck.
•
Disable parallelism temporarily Run one request at a time in production-like conditions. If the error disappears, concurrency is your problem.
•
Shorten the conversation Cut history to the last few turns and rerun. If rate limits stop, your context window was causing oversized requests.

Prevention

•
Put hard caps on:
- •maxTurns
- •concurrent agent runs
- •retry count per request
•
Add observability:
- •log every AutoGen call
- •record provider status codes
- •track tokens per request
•
Treat retries as a controlled system:
- •exponential backoff
- •jitter
- •circuit breaker after repeated 429s

If you want this to stay stable in production, design for quota pressure up front. AutoGen will happily keep asking for another turn unless you tell it not to.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit