AutoGen Tutorial (Python): implementing retry logic for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
autogenimplementing-retry-logic-for-intermediate-developerspython

This tutorial shows how to add retry logic to an AutoGen Python agent workflow so transient failures don’t kill the conversation. You need this when model calls, tool calls, or network requests fail intermittently and you want controlled retries instead of brittle one-shot execution.

What You'll Need

  • Python 3.10+
  • autogen-agentchat
  • autogen-ext
  • An OpenAI API key set as OPENAI_API_KEY
  • Basic familiarity with AutoGen agents, messages, and async Python
  • A terminal for running the example

Install the packages:

pip install autogen-agentchat autogen-ext openai

Step-by-Step

  1. Start with a minimal agent setup using the real AutoGen APIs. The key idea is to keep the agent stateless enough that you can safely rerun a failed turn.
import asyncio
import os

from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import TextMessage
from autogen_ext.models.openai import OpenAIChatCompletionClient

async def main() -> None:
    model_client = OpenAIChatCompletionClient(
        model="gpt-4o-mini",
        api_key=os.environ["OPENAI_API_KEY"],
    )

    agent = AssistantAgent(
        name="support_agent",
        model_client=model_client,
        system_message="You are a concise support assistant.",
    )

    result = await agent.on_messages(
        [TextMessage(content="Summarize retry logic in one sentence.", source="user")],
        cancellation_token=None,
    )
    print(result.chat_message.content)

if __name__ == "__main__":
    asyncio.run(main())
  1. Add a retry wrapper around the agent call. For production use, retry the entire turn only on transient exceptions, and keep the backoff bounded.
import asyncio
import random

RETRYABLE_ERRORS = (TimeoutError, ConnectionError)

async def call_with_retry(agent, messages, max_attempts: int = 3):
    last_error = None

    for attempt in range(1, max_attempts + 1):
        try:
            return await agent.on_messages(messages, cancellation_token=None)
        except RETRYABLE_ERRORS as e:
            last_error = e
            if attempt == max_attempts:
                raise
            delay = min(2 ** attempt, 8) + random.uniform(0, 0.5)
            print(f"Attempt {attempt} failed: {e}. Retrying in {delay:.2f}s")
            await asyncio.sleep(delay)

    raise last_error
  1. Wire the retry wrapper into your main flow. This keeps your business logic clean and makes retries explicit at the boundary where failures happen.
import asyncio
import os

from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import TextMessage
from autogen_ext.models.openai import OpenAIChatCompletionClient

async def main() -> None:
    model_client = OpenAIChatCompletionClient(
        model="gpt-4o-mini",
        api_key=os.environ["OPENAI_API_KEY"],
    )

    agent = AssistantAgent(
        name="support_agent",
        model_client=model_client,
        system_message="You are a concise support assistant.",
    )

    messages = [TextMessage(content="Explain exponential backoff.", source="user")]
    result = await call_with_retry(agent, messages, max_attempts=3)
    print(result.chat_message.content)

if __name__ == "__main__":
    asyncio.run(main())
  1. If your agent uses tools, wrap tool execution too. Tool failures are often more common than model failures, especially when calling internal services or databases.
async def safe_tool_call(tool_func, *args, max_attempts: int = 3, **kwargs):
    for attempt in range(1, max_attempts + 1):
        try:
            return await tool_func(*args, **kwargs)
        except RETRYABLE_ERRORS as e:
            if attempt == max_attempts:
                raise
            delay = min(2 ** attempt, 8)
            print(f"Tool attempt {attempt} failed: {e}. Retrying in {delay}s")
            await asyncio.sleep(delay)

async def flaky_lookup(customer_id: str) -> str:
    if customer_id == "123":
        raise ConnectionError("Temporary backend issue")
    return f"Customer {customer_id} is active"

async def demo_tool_retry() -> None:
    result = await safe_tool_call(flaky_lookup, "456")
    print(result)
  1. Keep retries idempotent by storing request context outside the retry loop. If you mutate shared state inside a failed turn, a retry can duplicate side effects like tickets created or records written.
from dataclasses import dataclass

@dataclass(frozen=True)
class RequestContext:
    request_id: str
    user_id: str

async def handle_support_request(agent, ctx: RequestContext, question: str):
    messages = [
        TextMessage(
            content=f"[request_id={ctx.request_id}] [user_id={ctx.user_id}] {question}",
            source="user",
        )
    ]
    return await call_with_retry(agent, messages)

# Use stable IDs from your app layer before calling the agent.

Testing It

Run the script with a valid OPENAI_API_KEY and confirm the assistant returns a normal response on the first try. Then simulate failures by temporarily raising ConnectionError inside call_with_retry or flaky_lookup and verify that retries happen with increasing delays.

Also test the failure path by setting max_attempts=1 and confirming exceptions propagate immediately. In production, log the attempt count and exception type so you can distinguish transient issues from real outages.

Next Steps

  • Add structured logging with request IDs and attempt counters.
  • Classify errors more precisely using provider-specific exception types.
  • Move retry policy into a shared utility so all agents and tools use the same backoff rules.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides