AutoGen Tutorial (Python): implementing retry logic for advanced developers
This tutorial shows how to add retry logic to an AutoGen agent workflow in Python without turning your orchestration code into a mess. You need this when model calls fail transiently, tools time out, or a downstream API returns a rate-limit error and you want the conversation to recover cleanly.
What You'll Need
- •Python 3.10+
- •
autogen-agentchat - •
autogen-ext - •An OpenAI API key
- •Basic familiarity with
AssistantAgent,UserProxyAgent, and tool registration in AutoGen - •A shell environment where you can set
OPENAI_API_KEY
Step-by-Step
- •Start by installing the packages and setting up your environment. For this pattern, use AutoGen’s agentchat and OpenAI model client packages so the retry wrapper sits outside the agent internals.
pip install autogen-agentchat autogen-ext openai
export OPENAI_API_KEY="your-key-here"
- •Build a small helper that retries failed agent runs with exponential backoff. The key idea is to catch transient exceptions around
run()and re-execute the same request with a short delay instead of trying to make AutoGen itself responsible for recovery.
import asyncio
from typing import Any, Callable, TypeVar
T = TypeVar("T")
async def run_with_retry(
fn: Callable[[], Any],
max_attempts: int = 4,
base_delay: float = 1.0,
) -> Any:
last_error = None
for attempt in range(1, max_attempts + 1):
try:
return await fn()
except Exception as e:
last_error = e
if attempt == max_attempts:
raise
await asyncio.sleep(base_delay * (2 ** (attempt - 1)))
raise last_error
- •Create an assistant agent using the modern AutoGen imports. This example uses the OpenAI chat completion client directly, which keeps the setup explicit and makes it easier to control retries around the conversation boundary.
import asyncio
from autogen_agentchat.agents import AssistantAgent
from autogen_ext.models.openai import OpenAIChatCompletionClient
client = OpenAIChatCompletionClient(
model="gpt-4o-mini",
)
agent = AssistantAgent(
name="support_agent",
model_client=client,
system_message=(
"You are a bank operations assistant. "
"Answer concisely and ask for missing details when needed."
),
)
- •Wrap the actual task execution in your retry helper. In production, this is where you catch network failures, 429s, or upstream service hiccups while keeping your application logic clean.
async def ask_agent(prompt: str) -> str:
result = await agent.run(task=prompt)
return result.messages[-1].content
async def main() -> None:
answer = await run_with_retry(
lambda: ask_agent("Summarize the steps required to freeze a debit card."),
max_attempts=3,
base_delay=1.5,
)
print(answer)
if __name__ == "__main__":
asyncio.run(main())
- •If your workflow includes tools, apply retry at the tool boundary too. That gives you finer control because some failures should be retried at the API-call level, while others should bubble up immediately.
import random
async def lookup_policy_status(policy_id: str) -> str:
if random.random() < 0.5:
raise RuntimeError("Transient policy service failure")
return f"Policy {policy_id} is active"
async def call_tool_with_retry(policy_id: str) -> str:
return await run_with_retry(
lambda: lookup_policy_status(policy_id),
max_attempts=5,
base_delay=0.5,
)
- •Add failure classification before retrying. Do not blindly retry everything; authentication errors, invalid prompts, and schema problems should fail fast because retrying them just burns tokens and hides defects.
def is_retryable(exc: Exception) -> bool:
message = str(exc).lower()
retryable_markers = [
"timeout",
"rate limit",
"429",
"temporarily unavailable",
"connection reset",
]
return any(marker in message for marker in retryable_markers)
async def run_with_filtered_retry(fn, max_attempts: int = 4):
for attempt in range(1, max_attempts + 1):
try:
return await fn()
except Exception as e:
if attempt == max_attempts or not is_retryable(e):
raise
await asyncio.sleep(2 ** (attempt - 1))
Testing It
Run the script against a prompt that is stable and confirm you get a normal response from the assistant on the first try. Then simulate failures by forcing your tool function to raise an exception or by temporarily disconnecting network access so you can see the backoff behavior kick in.
Watch for two things: retries should stop after max_attempts, and non-retryable errors should fail immediately instead of looping. If you’re logging in production, emit attempt number, exception type, and delay so ops can tell whether failures are transient or systemic.
Next Steps
- •Add structured logging with request IDs so retries can be traced across services.
- •Replace generic exception handling with provider-specific error classes from your model client.
- •Extend this pattern into a circuit breaker so repeated failures stop traffic before they cascade.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit