How to Fix 'rate limit exceeded when scaling' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-when-scalingautogenpython

What the error means

rate limit exceeded when scaling usually means your AutoGen app is creating more model calls than your provider allows, and the spike gets worse as you add agents, parallel tasks, or recursive group chats. In practice, this shows up when a script works with 1–2 agents, then starts failing once you scale to many concurrent conversations or let multiple agents call the same LLM at once.

The failure often surfaces as a provider error wrapped by AutoGen, for example:

  • openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit exceeded'}}
  • autogen.oai.client.OpenAIError: rate limit exceeded
  • litellm.RateLimitError: Rate limit exceeded when scaling

The Most Common Cause

The #1 cause is uncontrolled concurrency.

In AutoGen, people often spin up many ConversableAgent instances or run multiple chats at the same time without throttling. Each agent can trigger multiple model calls per turn, so what looks like “10 chats” can become 50+ API requests very quickly.

Broken pattern vs fixed pattern

BrokenFixed
Fires many chats in parallel with no backoffLimits concurrency and retries on 429
Reuses one config but lets every agent call freelyPuts a request gate in front of model calls
Assumes AutoGen will handle provider quotasExplicitly handles quota pressure in app code
# BROKEN: uncontrolled parallelism
import asyncio
from autogen import AssistantAgent

config_list = [
    {
        "model": "gpt-4o-mini",
        "api_key": "YOUR_KEY",
    }
]

agents = [
    AssistantAgent(
        name=f"agent_{i}",
        llm_config={"config_list": config_list},
    )
    for i in range(20)
]

async def run_all():
    tasks = []
    for agent in agents:
        tasks.append(agent.a_initiate_chat(
            recipient=agents[0],
            message="Summarize this customer claim."
        ))
    await asyncio.gather(*tasks)

asyncio.run(run_all())
# FIXED: bounded concurrency + retry/backoff
import asyncio
from tenacity import retry, wait_exponential_jitter, stop_after_attempt
from autogen import AssistantAgent

config_list = [
    {
        "model": "gpt-4o-mini",
        "api_key": "YOUR_KEY",
    }
]

semaphore = asyncio.Semaphore(3)

agent = AssistantAgent(
    name="claim_agent",
    llm_config={"config_list": config_list},
)

@retry(wait=wait_exponential_jitter(initial=1, max=30), stop=stop_after_attempt(5))
async def safe_chat(message: str):
    async with semaphore:
        return await agent.a_initiate_chat(
            recipient=agent,
            message=message,
        )

async def run_all():
    tasks = [safe_chat("Summarize this customer claim.") for _ in range(20)]
    await asyncio.gather(*tasks)

asyncio.run(run_all())

The important change is not just retries. You need to cap how many model calls are active at once. If you only add retries without limiting concurrency, you just create a bigger retry storm.

Other Possible Causes

1) Nested agent loops causing hidden extra calls

A GroupChat or nested ConversableAgent flow can multiply requests per user action. One visible turn may trigger planner, critic, executor, and summarizer calls.

# Example of a call multiplier
from autogen import GroupChat, GroupChatManager

groupchat = GroupChat(
    agents=[planner, coder, reviewer],
    messages=[],
    max_round=12,
)

manager = GroupChatManager(groupchat=groupchat)

If max_round is high and each agent uses the same provider account, quota gets burned fast.

2) No caching for repeated prompts

If your workflow sends the same prompt repeatedly during testing or batch jobs, you are paying for identical completions over and over.

llm_config = {
    "config_list": config_list,
    "cache_seed": None,   # no cache reuse
}

Use caching where appropriate:

llm_config = {
    "config_list": config_list,
    "cache_seed": 42,
}

This does not fix true concurrency issues, but it helps when your workload repeats prompts during evaluation or regression tests.

3) Too many tokens per request

A single huge prompt can hit token-rate limits even if request count is low. This is common when you keep appending full chat history into every turn.

# Risky: unbounded history growth
messages = conversation_history + [{"role": "user", "content": new_input}]

Trim history and summarize older context:

messages = recent_messages[-8:]
messages.insert(0, {"role": "system", "content": summary_of_older_context})

4) Multiple workers sharing one API key

If you deploy the same AutoGen service across several pods or Celery workers with one API key, each worker thinks it owns the quota.

# Example deployment issue
replicas: 6
env:
  OPENAI_API_KEY: shared-key-for-all-replicas

That setup is fine only if your provider quota supports aggregate traffic across all replicas. Otherwise you need per-worker throttling or a centralized queue.

How to Debug It

  1. Check whether the failure is request-rate or token-rate

    • Look at the exact exception text.
    • 429 Too Many Requests usually points to request bursts.
    • Messages mentioning tokens per minute point to oversized prompts or long histories.
  2. Log every model call

    • Add logging around each AssistantAgent, UserProxyAgent, and group chat turn.
    • Count how many completions happen per user action.
    • If one input causes 5–10 LLM calls, that’s your multiplier.
  3. Disable parallelism temporarily

    • Run everything serially.
    • Replace asyncio.gather(...) with a loop.
    • If the error disappears, you have a concurrency problem.
  4. Reduce scope until it stops failing

    • Lower max_round.
    • Remove one agent from the group.
    • Shorten prompts and truncate history.
    • If the error goes away after reducing rounds or context size, you found the pressure point.

Prevention

  • Put a semaphore or job queue in front of every LLM call path.
  • Add exponential backoff for 429, not just generic exceptions.
  • Cap chat rounds and trim conversation history before sending it back to the model.
  • Use caching for repeated evaluation runs and deterministic workflows.
  • Treat provider quota as an application dependency, not an afterthought.

If you are seeing this in AutoGen specifically, start by searching for uncontrolled parallelism in your orchestration code. In most production cases I’ve seen, that is the real bug—not AutoGen itself.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides