AutoGen Tutorial (Python): rate limiting API calls for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
autogenrate-limiting-api-calls-for-intermediate-developerspython

This tutorial shows how to add rate limiting to AutoGen-based Python agents so your app stops hammering external APIs and starts behaving like a production service. You need this when your agent can trigger bursts of tool calls, or when you’re paying for APIs that enforce strict per-minute quotas.

What You'll Need

  • Python 3.10+
  • pyautogen installed
  • requests installed
  • An OpenAI-compatible API key if you want to run an LLM-backed AutoGen agent
  • A target API endpoint to call from a tool function
  • Basic familiarity with AssistantAgent, UserProxyAgent, and tool registration in AutoGen

Install the packages:

pip install pyautogen requests

Step-by-Step

  1. Start by defining a small rate limiter that tracks timestamps in memory. This example uses a sliding window, which is simple and good enough for most single-process agent apps.
import time
from collections import deque


class RateLimiter:
    def __init__(self, max_calls: int, period_seconds: int):
        self.max_calls = max_calls
        self.period_seconds = period_seconds
        self.calls = deque()

    def wait(self) -> None:
        now = time.time()
        while self.calls and now - self.calls[0] > self.period_seconds:
            self.calls.popleft()

        if len(self.calls) >= self.max_calls:
            sleep_for = self.period_seconds - (now - self.calls[0])
            time.sleep(max(0, sleep_for))

        self.calls.append(time.time())
  1. Wrap every external API call behind a tool function. AutoGen will call this function when the assistant decides it needs the data, so the limiter belongs here rather than inside the agent itself.
import requests

limiter = RateLimiter(max_calls=3, period_seconds=10)


def fetch_json(url: str) -> dict:
    limiter.wait()
    response = requests.get(url, timeout=15)
    response.raise_for_status()
    return response.json()
  1. Register the tool with an AutoGen assistant and a user proxy. This gives the model access to the function while keeping execution on the Python side where you can control rate limits and retries.
from autogen import AssistantAgent, UserProxyAgent

llm_config = {
    "config_list": [
        {
            "model": "gpt-4o-mini",
            "api_key": "YOUR_OPENAI_API_KEY",
        }
    ]
}

assistant = AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)

user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    code_execution_config=False,
)

user_proxy.register_for_execution(name="fetch_json")(fetch_json)
assistant.register_for_llm(name="fetch_json", description="Fetch JSON from a URL")(fetch_json)
  1. Give the assistant a task that forces it to use the tool multiple times. If the rate limiter is working, you should see pauses once the call threshold is reached instead of a burst of immediate requests.
message = """
Use fetch_json on these URLs one by one and summarize each response:
https://jsonplaceholder.typicode.com/todos/1
https://jsonplaceholder.typicode.com/todos/2
https://jsonplaceholder.typicode.com/todos/3
https://jsonplaceholder.typicode.com/todos/4
"""

user_proxy.initiate_chat(assistant, message=message)
  1. If you want stronger production behavior, add retry handling for transient failures and keep rate limiting separate from retries. That keeps your limiter honest: it controls throughput, while retries handle network noise.
from requests import RequestException


def fetch_json_with_retry(url: str, retries: int = 3) -> dict:
    last_error = None

    for attempt in range(retries):
        try:
            limiter.wait()
            response = requests.get(url, timeout=15)
            response.raise_for_status()
            return response.json()
        except RequestException as exc:
            last_error = exc
            time.sleep(2 ** attempt)

    raise last_error

Testing It

Run the chat once and watch how long it takes to complete four calls with a limit of three calls per 10 seconds. You should see the fourth request delayed until the window opens again. Add simple logging inside RateLimiter.wait() if you want proof that sleeping is happening at the right time.

For a more realistic test, lower max_calls to 1 and use two or three URLs. That makes the delay obvious and helps confirm that all outbound traffic is going through your wrapped tool function instead of bypassing it.

Next Steps

  • Move the limiter state into Redis if you run multiple agent workers or multiple processes.
  • Add per-tool limits so expensive APIs get stricter caps than cheap internal services.
  • Combine this with circuit breakers and structured retries for cleaner failure handling in production agent workflows.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides