AutoGen Tutorial (Python): rate limiting API calls for advanced developers
This tutorial shows you how to rate limit AutoGen-driven API calls in Python using a production-friendly wrapper around your model client. You need this when multiple agents, retries, or tool calls can easily exceed provider quotas and start throwing 429s.
What You'll Need
- •Python 3.10+
- •
autogen-agentchat - •
autogen-ext - •
openai - •An OpenAI API key set as
OPENAI_API_KEY - •Basic familiarity with AutoGen agents and model clients
- •A terminal with
pipavailable
Step-by-Step
- •Install the packages and confirm the OpenAI client is available.
We’ll use AutoGen’s async OpenAI chat client and wrap it with a rate limiter before handing it to the agent.
pip install autogen-agentchat autogen-ext openai
- •Create a small token-bucket limiter.
This version limits requests per minute and uses an async lock so concurrent agent calls don’t race each other.
import asyncio
import time
class RateLimiter:
def __init__(self, max_calls: int, period_seconds: float):
self.max_calls = max_calls
self.period_seconds = period_seconds
self.calls = []
self.lock = asyncio.Lock()
async def acquire(self):
async with self.lock:
now = time.monotonic()
cutoff = now - self.period_seconds
self.calls = [t for t in self.calls if t > cutoff]
if len(self.calls) >= self.max_calls:
sleep_for = self.period_seconds - (now - self.calls[0])
await asyncio.sleep(max(0, sleep_for))
return await self.acquire()
self.calls.append(time.monotonic())
- •Wrap the AutoGen model client with the limiter.
The wrapper delegates to the real OpenAI client but forces every request throughacquire()first.
from autogen_ext.models.openai import OpenAIChatCompletionClient
class RateLimitedModelClient:
def __init__(self, client: OpenAIChatCompletionClient, limiter: RateLimiter):
self.client = client
self.limiter = limiter
async def create(self, *args, **kwargs):
await self.limiter.acquire()
return await self.client.create(*args, **kwargs)
async def close(self):
await self.client.close()
- •Wire the client into an AutoGen assistant agent and run a few calls.
The example below uses a single assistant agent; in real systems you’d apply the same wrapper to any shared model client used by multiple agents.
import asyncio
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import TextMessage
async def main():
base_client = OpenAIChatCompletionClient(
model="gpt-4o-mini",
api_key=None,
)
limiter = RateLimiter(max_calls=2, period_seconds=10)
client = RateLimitedModelClient(base_client, limiter)
agent = AssistantAgent(
name="assistant",
model_client=client,
system_message="You are a concise assistant.",
)
for i in range(4):
response = await agent.on_messages(
[TextMessage(content=f"Say hello {i}", source="user")],
cancellation_token=None,
)
print(response.chat_message.content)
await client.close()
if __name__ == "__main__":
asyncio.run(main())
- •Add backoff for provider-side 429s.
Local throttling helps, but you still want retry logic because other processes may share your quota or the provider may enforce burst limits.
import random
async def call_with_retry(agent, message, attempts=5):
delay = 1.0
for attempt in range(attempts):
try:
return await agent.on_messages([message], cancellation_token=None)
except Exception as e:
if "429" not in str(e) or attempt == attempts - 1:
raise
await asyncio.sleep(delay + random.uniform(0, 0.25))
delay *= 2
Testing It
Run the script and watch the timestamps between responses if you add logging inside acquire(). With max_calls=2 and period_seconds=10, the third and fourth requests should pause instead of firing immediately.
To verify concurrency control, start two tasks that call the same shared client at once; only two total requests should pass through per 10-second window. If you still see 429s, lower your local limit further or add retry handling around every top-level workflow entrypoint.
A good production check is to simulate burst traffic from multiple agents sharing one client instance. If your limiter is working correctly, request spikes should flatten without changing your agent code.
Next Steps
- •Add per-model and per-provider limiters so GPT-4o and embeddings don’t compete for the same budget.
- •Replace the simple list-based limiter with Redis when you need distributed rate limiting across workers.
- •Combine this with circuit breakers and structured retries for cleaner failure handling under sustained quota pressure.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit