How to Fix 'rate limit exceeded' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceededautogenpython

What the error means

rate limit exceeded in AutoGen usually means your app is sending too many LLM requests in a short window, or you’re using a model/org key that has stricter quota than you expected. In practice, it shows up when agents loop too aggressively, multiple agents fire at once, or retries keep hammering the API after a failure.

The actual exception often looks like this:

openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit exceeded', 'type': 'rate_limit_exceeded', ...}}

The Most Common Cause

The #1 cause is an agent conversation that keeps auto-replying without a hard stop. In AutoGen, AssistantAgent and UserProxyAgent can easily create a loop if your termination condition is weak or missing.

Here’s the broken pattern:

BrokenFixed
Infinite or long-running auto-reply loopExplicit max turns / termination condition
No backoff between callsControlled request pacing
Multiple agents responding to each other endlesslyOne agent initiates, one agent terminates
# BROKEN
from autogen import AssistantAgent, UserProxyAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config={"config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}]},
)

user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    code_execution_config=False,
)

# This can keep going until the API rate limit is hit
user_proxy.initiate_chat(
    assistant,
    message="Draft a policy summary and keep improving it until it's perfect."
)
# FIXED
from autogen import AssistantAgent, UserProxyAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config={
        "config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}],
        "max_tokens": 800,
    },
)

user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    code_execution_config=False,
)

# Hard stop: limit turns and give the model a concrete end condition
user_proxy.initiate_chat(
    assistant,
    message=(
        "Draft a policy summary in 5 bullets. "
        "Stop after one response."
    ),
    max_turns=2,
)

If you’re using GroupChat, the same problem gets worse because multiple agents may respond in sequence. Set max_round and make sure only one agent is allowed to speak at a time when appropriate.

Other Possible Causes

1) Too many parallel requests

If you fan out tasks with asyncio.gather() or multiprocessing, you can exceed per-minute limits very quickly.

# Too aggressive
results = await asyncio.gather(*[
    agent.a_generate_reply(messages) for _ in range(20)
])

Use a semaphore or queue:

sem = asyncio.Semaphore(3)

async def limited_call(task):
    async with sem:
        return await agent.a_generate_reply(task)

results = await asyncio.gather(*[limited_call(t) for t in tasks])

2) Retries without backoff

A failed request retried immediately just compounds the problem.

# Bad retry behavior: instant retry loop
for _ in range(5):
    try:
        return assistant.generate_reply(messages)
    except Exception:
        pass

Use exponential backoff:

import time

delay = 1
for attempt in range(5):
    try:
        return assistant.generate_reply(messages)
    except Exception as e:
        if attempt == 4:
            raise
        time.sleep(delay)
        delay *= 2

3) Token-heavy prompts causing repeated truncation/retries

Huge chat histories can push requests into failures that look like rate limiting when the client retries behind the scenes.

llm_config = {
    "config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}],
}
# history grows forever unless trimmed

Trim history and summarize state:

llm_config = {
    "config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}],
    "max_tokens": 600,
}

Then periodically reset or summarize conversation state instead of passing every prior message.

4) Wrong key, wrong org, or low quota

Sometimes the issue is not traffic volume but account limits. A dev key, free-tier org, or stale environment variable can trigger 429 responses fast.

export OPENAI_API_KEY=sk-...
export OPENAI_ORG=org-old-value   # stale org can route you to the wrong quota bucket

Check what your process actually sees:

import os

print(os.getenv("OPENAI_API_KEY")[:8])
print(os.getenv("OPENAI_ORG"))

How to Debug It

  1. Confirm where the 429 comes from

    • Look at the stack trace.
    • If it points to openai.RateLimitError, it’s an API quota/rate issue.
    • If it happens inside AutoGen agent loops, suspect repeated chat turns.
  2. Count actual LLM calls

    • Add logging around every generate_reply() / a_generate_reply() call.
    • If one user action triggers 10+ calls, your agent orchestration is too chatty.
  3. Reduce concurrency to 1

    • Temporarily remove asyncio.gather(), threads, and parallel workers.
    • If the error disappears, you’ve found a throughput problem.
  4. Inspect config values

    • Check model name, API key, org/project settings, and max turn settings.
    • Verify whether you’re using AssistantAgent, UserProxyAgent, or GroupChatManager with unbounded rounds.

Prevention

  • Set hard limits everywhere:

    • max_turns
    • max_round
    • request concurrency caps
    • retry ceilings with exponential backoff
  • Summarize long conversations instead of carrying full history forever.

  • Put rate-aware wrappers around AutoGen calls so one bad workflow cannot flood your provider quota.

If you want one rule of thumb: when AutoGen starts behaving like a loop machine, assume the problem is orchestration first and provider limits second.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides