AutoGen Tutorial (Python): implementing retry logic for beginners

By Cyprian AaronsUpdated 2026-04-21
autogenimplementing-retry-logic-for-beginnerspython

This tutorial shows you how to add retry logic to an AutoGen Python agent workflow so transient failures do not break the whole conversation. You need this when model calls fail intermittently, a tool endpoint times out, or you want your agent loop to recover cleanly instead of crashing on the first error.

What You'll Need

  • Python 3.10+
  • autogen-agentchat
  • autogen-ext
  • An OpenAI API key set as OPENAI_API_KEY
  • Basic familiarity with AutoGen agents and model clients
  • A terminal and a virtual environment

Install the packages first:

pip install autogen-agentchat autogen-ext openai

Step-by-Step

  1. Start with a minimal assistant agent and a model client.
    This gives you the base object that will be retried when something fails.
import os
from autogen_agentchat.agents import AssistantAgent
from autogen_ext.models.openai import OpenAIChatCompletionClient

model_client = OpenAIChatCompletionClient(
    model="gpt-4o-mini",
    api_key=os.environ["OPENAI_API_KEY"],
)

agent = AssistantAgent(
    name="assistant",
    model_client=model_client,
)
  1. Add a small retry helper around the async call.
    The pattern here is simple: try the agent call, catch transient exceptions, wait, then try again with backoff.
import asyncio
from typing import Iterable

async def run_with_retry(agent, task: str, max_attempts: int = 3) -> str:
    delay = 1.0

    for attempt in range(1, max_attempts + 1):
        try:
            result = await agent.run(task=task)
            return result.messages[-1].content
        except Exception as e:
            if attempt == max_attempts:
                raise
            print(f"Attempt {attempt} failed: {e}. Retrying in {delay}s...")
            await asyncio.sleep(delay)
            delay *= 2
  1. Make the retry logic more selective.
    In production, you should not retry every exception. Retry timeouts, rate limits, and network errors; fail fast on bad inputs or coding mistakes.
import httpx

def is_retryable_exception(exc: Exception) -> bool:
    retryable_types: tuple[type[Exception], ...] = (
        httpx.TimeoutException,
        httpx.NetworkError,
        ConnectionError,
        TimeoutError,
        OSError,
    )
    return isinstance(exc, retryable_types)

async def run_with_selective_retry(agent, task: str, max_attempts: int = 3) -> str:
    delay = 1.0

    for attempt in range(1, max_attempts + 1):
        try:
            result = await agent.run(task=task)
            return result.messages[-1].content
        except Exception as e:
            if not is_retryable_exception(e) or attempt == max_attempts:
                raise
            print(f"Retryable failure on attempt {attempt}: {e}")
            await asyncio.sleep(delay)
            delay *= 2
  1. Wrap the whole thing in a runnable script entry point.
    This makes it easy to test from the command line and keeps your retry behavior isolated from the rest of your app.
import asyncio

async def main() -> None:
    answer = await run_with_selective_retry(
        agent,
        "Write one sentence explaining what retry logic does in an AI agent.",
        max_attempts=3,
    )
    print("\nFinal answer:")
    print(answer)

if __name__ == "__main__":
    asyncio.run(main())
  1. If you are calling tools inside AutoGen, put retries at the boundary that fails most often.
    For API calls, wrap the tool function itself so the agent can keep going even if one external dependency flakes out.
import random

def unreliable_tool() -> str:
    if random.random() < 0.7:
        raise TimeoutError("Simulated timeout")
    return "Tool succeeded"

def tool_with_retry(max_attempts: int = 3) -> str:
    delay = 0.5

    for attempt in range(1, max_attempts + 1):
        try:
            return unreliable_tool()
        except TimeoutError as e:
            if attempt == max_attempts:
                raise
            print(f"Tool failed: {e}. Retrying...")
            asyncio.sleep(delay)
            delay *= 2

Testing It

Run the script with a valid OPENAI_API_KEY and confirm you get a final assistant response instead of an immediate crash. Then temporarily break connectivity or use an invalid model name to see whether your retry logic prints attempts before giving up.

If you added tool retries, force the tool to fail a few times and verify it eventually succeeds or exits after max_attempts. The important check is that only retryable failures are retried; programming errors should still surface immediately.

Next Steps

  • Add structured logging so each retry includes request IDs, exception type, and elapsed time.
  • Replace manual backoff with exponential backoff plus jitter for production traffic.
  • Learn how to combine retries with circuit breakers so repeated upstream failures do not hammer a broken dependency

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides