CrewAI Tutorial (Python): implementing retry logic for beginners

By Cyprian AaronsUpdated 2026-04-21
crewaiimplementing-retry-logic-for-beginnerspython

This tutorial shows you how to add retry logic to a CrewAI Python workflow so failed agent calls can be retried automatically instead of breaking the whole run. You need this when an LLM request times out, a tool call fails transiently, or an external API returns a temporary error.

What You'll Need

  • Python 3.10+
  • crewai
  • python-dotenv
  • An OpenAI API key in your environment
  • Basic familiarity with Agent, Task, and Crew in CrewAI
  • A terminal and a virtual environment

Install the packages:

pip install crewai python-dotenv

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

  1. Start with a minimal CrewAI setup. The retry logic will wrap the crew execution, so keep the agent and task simple first.
from dotenv import load_dotenv
from crewai import Agent, Task, Crew, Process

load_dotenv()

researcher = Agent(
    role="Researcher",
    goal="Answer questions accurately",
    backstory="You are careful and concise.",
    verbose=True,
)

task = Task(
    description="Explain what retry logic is in one paragraph.",
    expected_output="A short explanation of retry logic.",
    agent=researcher,
)

crew = Crew(
    agents=[researcher],
    tasks=[task],
    process=Process.sequential,
)
  1. Add a retry wrapper around crew.kickoff(). This is the simplest production pattern: catch transient failures, wait briefly, then try again with exponential backoff.
import time

def run_with_retries(crew: Crew, max_attempts: int = 3, base_delay: float = 2.0):
    last_error = None

    for attempt in range(1, max_attempts + 1):
        try:
            print(f"Attempt {attempt}/{max_attempts}")
            return crew.kickoff()
        except Exception as exc:
            last_error = exc
            if attempt == max_attempts:
                break

            delay = base_delay * (2 ** (attempt - 1))
            print(f"Run failed: {exc}")
            print(f"Retrying in {delay:.1f} seconds...")
            time.sleep(delay)

    raise last_error
  1. Run the crew through the wrapper instead of calling kickoff() directly. Keep the output handling simple so you can see whether the retry path was used.
if __name__ == "__main__":
    result = run_with_retries(crew, max_attempts=3, base_delay=1.5)
    print("\nFinal result:")
    print(result)
  1. If your failure is coming from a tool rather than the LLM call itself, wrap the tool too. This keeps retries local to the failing dependency instead of rerunning everything blindly.
from crewai.tools import BaseTool

class FlakyLookupTool(BaseTool):
    name: str = "flaky_lookup"
    description: str = "Simulates an unreliable external lookup"

    def _run(self, query: str) -> str:
        if "fail" in query.lower():
            raise RuntimeError("Temporary upstream error")
        return f"Result for: {query}"
  1. Attach the tool to an agent and let the same retry wrapper handle temporary failures. In real projects, you would use this pattern for HTTP calls, database reads, or vendor APIs that occasionally fail.
tool_agent = Agent(
    role="Support Analyst",
    goal="Use tools safely and answer clearly",
    backstory="You verify data before responding.",
    tools=[FlakyLookupTool()],
    verbose=True,
)

tool_task = Task(
    description="Use flaky_lookup on the query 'please fail once'.",
    expected_output="A response based on tool output.",
    agent=tool_agent,
)

tool_crew = Crew(
    agents=[tool_agent],
    tasks=[tool_task],
    process=Process.sequential,
)

Testing It

Run the script once with a normal prompt and confirm it completes on the first attempt. Then change the task or tool input so it triggers a failure and watch the console show multiple attempts with increasing delays.

If you want to test the wrapper itself without waiting on real API errors, temporarily raise an exception inside _run() or inside run_with_retries(). You should see the function retry until it reaches max_attempts, then raise the final exception.

For production use, log each attempt with structured fields like attempt, delay, task_name, and error_type. That makes it much easier to debug rate limits versus genuine application bugs.

Next Steps

  • Add retry filtering so you only retry transient errors like timeouts and rate limits
  • Move retry settings into environment variables or config files
  • Add idempotency checks if your crew triggers side effects like sending emails or writing records

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides