How to Fix 'rate limit exceeded in production' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-in-productionautogenpython

What the error means

rate limit exceeded in production usually means your AutoGen agents are sending too many LLM requests in a short window, or they’re retrying in a loop after failures. In practice, this shows up when you run multi-agent conversations, recursive tool calls, or parallel agent tasks against OpenAI or Azure OpenAI.

The failure is rarely “just the API is down.” It usually means your agent orchestration is generating more calls than your quota, RPM/TPM limits, or retry policy can handle.

The Most Common Cause

The #1 cause is uncontrolled agent chatter: agents keep replying to each other without a hard stop, and every turn triggers another model call. In AutoGen, this often happens with AssistantAgent + UserProxyAgent when max_consecutive_auto_reply is too high, or when the conversation loop has no termination condition.

Here’s the broken pattern:

Broken codeFixed code
```python
from autogen import AssistantAgent, UserProxyAgent

llm_config = { "model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"], }

assistant = AssistantAgent( name="assistant", llm_config=llm_config, )

user_proxy = UserProxyAgent( name="user_proxy", human_input_mode="NEVER", max_consecutive_auto_reply=20, )

user_proxy.initiate_chat( assistant, message="Review this contract and suggest changes." ) |python from autogen import AssistantAgent, UserProxyAgent

llm_config = { "model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"], }

assistant = AssistantAgent( name="assistant", llm_config=llm_config, )

user_proxy = UserProxyAgent( name="user_proxy", human_input_mode="NEVER", max_consecutive_auto_reply=3, )

user_proxy.initiate_chat( assistant, message="Review this contract and suggest changes." )


The fix is not just lowering a number. You need to make the conversation terminate predictably.

```python
def is_termination_msg(msg):
    content = msg.get("content", "")
    return "DONE" in content or "TERMINATE" in content

assistant = AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)

user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=3,
    is_termination_msg=is_termination_msg,
)

If your agents never emit a stop signal, AutoGen will keep calling the model until you hit RPM/TPM limits.

Other Possible Causes

1. Parallel tasks are firing too many requests at once

If you use multiple agents or spawn concurrent chats, you can blow through request limits fast.

# Risky: many chats at once
await asyncio.gather(*[
    run_case(case) for case in cases
])

Use a semaphore or batch work.

sem = asyncio.Semaphore(2)

async def limited_run(case):
    async with sem:
        return await run_case(case)

await asyncio.gather(*(limited_run(case) for case in cases))

2. Retry logic is multiplying the traffic

Some production stacks wrap AutoGen calls with retries on top of SDK retries. That turns one failed request into three or four immediate retries.

# Bad: retry storm
@retry(stop=stop_after_attempt(5), wait=wait_fixed(0))
def call_agent():
    return user_proxy.initiate_chat(assistant, message="Process claim")

Use exponential backoff and cap attempts.

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def call_agent():
    return user_proxy.initiate_chat(assistant, message="Process claim")

3. Your prompt is causing long outputs and extra turns

Long responses increase token usage and can trigger TPM limits even if request count looks fine.

assistant = AssistantAgent(
    name="assistant",
    llm_config={
        "model": "gpt-4o-mini",
        "api_key": os.environ["OPENAI_API_KEY"],
        "max_tokens": 4000,
        "temperature": 0.7,
    },
)

Reduce output size and force concise answers.

assistant = AssistantAgent(
    name="assistant",
    llm_config={
        "model": "gpt-4o-mini",
        "api_key": os.environ["OPENAI_API_KEY"],
        "max_tokens": 800,
        "temperature": 0.2,
        "cache_seed": 42,
    },
)

4. Multiple workers are sharing one quota unexpectedly

A common production mistake is deploying several replicas with no coordination. Each worker thinks it has its own budget.

# Example: 6 pods all using the same API key
replicas: 6
env:
  - name: OPENAI_API_KEY

If your org rate limit is low, add a global queue or reduce concurrency per pod.

How to Debug It

  1. Inspect the exact exception Look for messages like:

    • openai.RateLimitError: Error code: 429
    • Rate limit reached for gpt-4o-mini
    • Too Many Requests

    If you see 429, this is not an AutoGen bug first. It’s a capacity problem or retry storm.

  2. Count model calls per chat Log every LLM request from your AutoGen wrapper.

    print("Calling assistant...")
    result = user_proxy.initiate_chat(assistant, message=msg)
    

    If one user action triggers dozens of calls, your termination logic is wrong.

  3. Check concurrency Search for:

    • asyncio.gather
    • thread pools
    • multiple Celery workers
    • parallel job runners

    If requests spike at deployment time but not locally, concurrency is usually the reason.

  4. Disable retries temporarily Turn off app-level retries and observe the raw failure rate.

    If errors disappear when retries are removed, you had retry amplification instead of a true capacity issue.

Prevention

  • Set hard limits on agent loops:

    • max_consecutive_auto_reply
    • explicit termination messages
    • bounded task plans
  • Control traffic at the orchestration layer:

    • semaphore-based concurrency caps
    • queue-based job dispatch
    • per-tenant rate limiting
  • Keep production prompts short and deterministic:

    • lower max_tokens
    • reduce temperature
    • avoid open-ended “keep going until done” instructions

If you’re running AutoGen in production and seeing repeated 429 errors, assume the system is over-calling the model before you assume the provider is broken. Fix the conversation boundaries first; then tune concurrency and retries.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides