AutoGen Tutorial (Python): rate limiting API calls for beginners

By Cyprian AaronsUpdated 2026-04-21
autogenrate-limiting-api-calls-for-beginnerspython

This tutorial shows you how to rate limit API calls in an AutoGen-based Python agent workflow using a simple, production-friendly wrapper. You need this when your agent can trigger too many requests too quickly and you want to avoid 429 errors, wasted retries, or getting your API key throttled.

What You'll Need

  • Python 3.10+
  • pyautogen installed
  • A valid LLM API key, set as an environment variable
  • Basic AutoGen knowledge: AssistantAgent, UserProxyAgent, and initiate_chat
  • An API you want to protect with rate limiting
  • Optional: python-dotenv if you prefer loading secrets from a .env file

Install the package:

pip install pyautogen

Step-by-Step

  1. Start with a minimal AutoGen setup.
    We’ll use one assistant agent and one user proxy agent. The important part is that the tool call will go through a rate-limited wrapper instead of calling the API directly.
import os
import time
from collections import deque

from autogen import AssistantAgent, UserProxyAgent

llm_config = {
    "model": "gpt-4o-mini",
    "api_key": os.environ["OPENAI_API_KEY"],
}

assistant = AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)

user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    code_execution_config=False,
)
  1. Build a small rate limiter around your API call.
    This version allows at most 3 calls every 10 seconds. It is simple, deterministic, and easy to reason about in production logs.
class RateLimiter:
    def __init__(self, max_calls: int, period_seconds: int):
        self.max_calls = max_calls
        self.period_seconds = period_seconds
        self.calls = deque()

    def wait(self):
        now = time.time()
        while self.calls and now - self.calls[0] >= self.period_seconds:
            self.calls.popleft()

        if len(self.calls) >= self.max_calls:
            sleep_for = self.period_seconds - (now - self.calls[0])
            time.sleep(max(0, sleep_for))
            now = time.time()
            while self.calls and now - self.calls[0] >= self.period_seconds:
                self.calls.popleft()

        self.calls.append(now)
  1. Wrap the external API call with the limiter.
    In real projects this function would call Stripe, Salesforce, your internal policy service, or another LLM endpoint. For beginners, this example just simulates an outbound request so you can see the pacing clearly.
limiter = RateLimiter(max_calls=3, period_seconds=10)

def limited_api_call(payload: str) -> str:
    limiter.wait()
    print(f"Calling API with payload: {payload}")
    return f"processed:{payload}"
  1. Expose the wrapper as an AutoGen tool function.
    AutoGen can call normal Python functions directly when you register them on the user proxy. This keeps the rate limit enforcement outside the model and inside your application code where it belongs.
def fetch_customer_record(customer_id: str) -> str:
    return limited_api_call(f"customer_id={customer_id}")

user_proxy.register_function(
    function_map={
        "fetch_customer_record": fetch_customer_record,
    }
)
  1. Ask the assistant to use the tool multiple times.
    The assistant will generate tool calls, and each one goes through your limiter before hitting the underlying API logic. If calls happen too fast, the wrapper sleeps until the window opens again.
task = """
Fetch customer records for IDs 101, 102, 103, 104, and 105.
Use fetch_customer_record for each one.
Return a short summary of what you got back.
"""

user_proxy.initiate_chat(
    assistant,
    message=task,
)
  1. Add retry-safe logging if you want something closer to production behavior.
    A plain sleep is fine for learning, but in real systems you usually want structured logs so you can see when throttling happens and why latency increased.
def limited_api_call(payload: str) -> str:
    before = time.time()
    limiter.wait()
    waited = time.time() - before

    if waited > 0:
        print(f"[rate-limit] waited {waited:.2f}s before calling API")

    print(f"[api] payload={payload}")
    return f"processed:{payload}"

Testing It

Run the script and watch the console output. You should see at most 3 immediate calls, then a pause before the next ones continue.

If you want to verify behavior more aggressively, lower period_seconds to 5 and keep max_calls at 2. That makes throttling obvious without waiting long.

The key thing to check is that your application never sends more than the allowed number of requests inside the time window. If you replace limited_api_call() with a real vendor SDK call later, keep this same wrapper in place.

Next Steps

  • Add exponential backoff for vendor-side 429 Too Many Requests responses.
  • Move from in-memory limiting to Redis if you run multiple worker processes.
  • Add per-user or per-tenant limits so one chat session cannot starve others.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides