AutoGen Tutorial (Python): rate limiting API calls for advanced developers

By Cyprian AaronsUpdated 2026-04-21

autogenrate-limiting-api-calls-for-advanced-developerspython

This tutorial shows you how to rate limit AutoGen-driven API calls in Python using a production-friendly wrapper around your model client. You need this when multiple agents, retries, or tool calls can easily exceed provider quotas and start throwing 429s.

What You'll Need

•Python 3.10+
•autogen-agentchat
•autogen-ext
•openai
•An OpenAI API key set as OPENAI_API_KEY
•Basic familiarity with AutoGen agents and model clients
•A terminal with pip available

Step-by-Step

•Install the packages and confirm the OpenAI client is available.
We’ll use AutoGen’s async OpenAI chat client and wrap it with a rate limiter before handing it to the agent.

pip install autogen-agentchat autogen-ext openai

•Create a small token-bucket limiter.
This version limits requests per minute and uses an async lock so concurrent agent calls don’t race each other.

import asyncio
import time


class RateLimiter:
    def __init__(self, max_calls: int, period_seconds: float):
        self.max_calls = max_calls
        self.period_seconds = period_seconds
        self.calls = []
        self.lock = asyncio.Lock()

    async def acquire(self):
        async with self.lock:
            now = time.monotonic()
            cutoff = now - self.period_seconds
            self.calls = [t for t in self.calls if t > cutoff]

            if len(self.calls) >= self.max_calls:
                sleep_for = self.period_seconds - (now - self.calls[0])
                await asyncio.sleep(max(0, sleep_for))
                return await self.acquire()

            self.calls.append(time.monotonic())

•Wrap the AutoGen model client with the limiter.
The wrapper delegates to the real OpenAI client but forces every request through acquire() first.

from autogen_ext.models.openai import OpenAIChatCompletionClient


class RateLimitedModelClient:
    def __init__(self, client: OpenAIChatCompletionClient, limiter: RateLimiter):
        self.client = client
        self.limiter = limiter

    async def create(self, *args, **kwargs):
        await self.limiter.acquire()
        return await self.client.create(*args, **kwargs)

    async def close(self):
        await self.client.close()

•Wire the client into an AutoGen assistant agent and run a few calls.
The example below uses a single assistant agent; in real systems you’d apply the same wrapper to any shared model client used by multiple agents.

import asyncio
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import TextMessage


async def main():
    base_client = OpenAIChatCompletionClient(
        model="gpt-4o-mini",
        api_key=None,
    )
    limiter = RateLimiter(max_calls=2, period_seconds=10)
    client = RateLimitedModelClient(base_client, limiter)

    agent = AssistantAgent(
        name="assistant",
        model_client=client,
        system_message="You are a concise assistant.",
    )

    for i in range(4):
        response = await agent.on_messages(
            [TextMessage(content=f"Say hello {i}", source="user")],
            cancellation_token=None,
        )
        print(response.chat_message.content)

    await client.close()


if __name__ == "__main__":
    asyncio.run(main())

•Add backoff for provider-side 429s.
Local throttling helps, but you still want retry logic because other processes may share your quota or the provider may enforce burst limits.

import random


async def call_with_retry(agent, message, attempts=5):
    delay = 1.0
    for attempt in range(attempts):
        try:
            return await agent.on_messages([message], cancellation_token=None)
        except Exception as e:
            if "429" not in str(e) or attempt == attempts - 1:
                raise
            await asyncio.sleep(delay + random.uniform(0, 0.25))
            delay *= 2

Testing It

Run the script and watch the timestamps between responses if you add logging inside acquire(). With max_calls=2 and period_seconds=10, the third and fourth requests should pause instead of firing immediately.

To verify concurrency control, start two tasks that call the same shared client at once; only two total requests should pass through per 10-second window. If you still see 429s, lower your local limit further or add retry handling around every top-level workflow entrypoint.

A good production check is to simulate burst traffic from multiple agents sharing one client instance. If your limiter is working correctly, request spikes should flatten without changing your agent code.

Next Steps

•Add per-model and per-provider limiters so GPT-4o and embeddings don’t compete for the same budget.
•Replace the simple list-based limiter with Redis when you need distributed rate limiting across workers.
•Combine this with circuit breakers and structured retries for cleaner failure handling under sustained quota pressure.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit