CrewAI Tutorial (Python): testing agents locally for advanced developers

By Cyprian AaronsUpdated 2026-04-21
crewaitesting-agents-locally-for-advanced-developerspython

This tutorial shows you how to run CrewAI agents locally, swap live LLM calls for deterministic test doubles, and verify agent behavior without burning API credits. You need this when you want fast feedback on prompts, tools, and task wiring before pushing anything into a shared environment.

What You'll Need

  • Python 3.10+
  • crewai
  • pytest
  • python-dotenv
  • An OpenAI-compatible API key if you want to run the same crew against a real model later
  • A local project folder with write access
  • Basic familiarity with Agent, Task, and Crew

Install the dependencies:

pip install crewai pytest python-dotenv

Step-by-Step

  1. Create a small project layout and keep your agent code isolated from your tests. That makes it easy to swap real model calls for local fakes without touching production code.
crew-local-test/
├── app.py
├── test_app.py
└── .env
  1. Define your crew in a way that accepts an injected LLM. For local testing, use a stubbed LLM that returns predictable output; for real runs, replace it with your provider config.
# app.py
from crewai import Agent, Task, Crew, Process

def build_crew(llm):
    analyst = Agent(
        role="Claims Analyst",
        goal="Summarize claim notes clearly",
        backstory="You review insurance claims and produce concise summaries.",
        llm=llm,
        verbose=False,
    )

    task = Task(
        description="Summarize the following claim note in 2 bullet points: Customer reported water damage.",
        expected_output="Two bullet points summarizing the note.",
        agent=analyst,
    )

    return Crew(
        agents=[analyst],
        tasks=[task],
        process=Process.sequential,
        verbose=False,
    )
  1. Build a local fake LLM for deterministic tests. CrewAI only needs an object with a callable interface in many setups, so we can return fixed text and validate the downstream behavior.
# test_app.py
from app import build_crew

class FakeLLM:
    def __call__(self, prompt, **kwargs):
        return "- Water damage was reported by the customer.\n- The claim requires assessment for cause and scope."

def test_crew_runs_locally():
    crew = build_crew(FakeLLM())
    result = crew.kickoff()
    text = str(result)

    assert "Water damage" in text
    assert "assessment" in text
  1. Add a real-model path for manual local verification. This lets you compare the fake output against an actual provider when you want to test prompt quality or tool behavior.
# app.py
import os
from dotenv import load_dotenv

load_dotenv()

class OpenAIConfig:
    def __init__(self):
        self.api_key = os.getenv("OPENAI_API_KEY")

if __name__ == "__main__":
    from crewai import LLM

    llm = LLM(
        model="gpt-4o-mini",
        api_key=os.getenv("OPENAI_API_KEY"),
    )

    crew = build_crew(llm)
    result = crew.kickoff()
    print(result)
  1. Run the tests locally and keep them fast. The point is to catch broken task definitions, prompt regressions, and bad assumptions before you hit the network.
pytest -q
  1. If you want stronger checks, assert on structure instead of just substrings. For advanced teams, that usually means validating formatting, length limits, or JSON-shaped output from tasks.
# test_app.py
from app import build_crew

class FakeLLM:
    def __call__(self, prompt, **kwargs):
        return "- Water damage was reported.\n- Needs adjuster review."

def test_output_has_two_bullets():
    crew = build_crew(FakeLLM())
    result = str(crew.kickoff())
    lines = [line for line in result.splitlines() if line.strip()]
    assert len(lines) == 2
    assert all(line.startswith("- ") for line in lines)

Testing It

Start with pytest -q and make sure the fake LLM test passes consistently. Then run python app.py with a valid OPENAI_API_KEY in .env to confirm the same crew works against a live model.

If the test passes but the live run fails, the issue is usually in provider config or prompt expectations, not your Python wiring. If both pass but output quality is poor, tighten the task description and add assertions around format or content.

For local debugging, print repr(result) or cast it to str(result) so you can inspect exactly what CrewAI returned. That matters because many failures are not exceptions; they’re just low-quality outputs that still satisfy execution.

Next Steps

  • Add tool testing with mocked HTTP clients so your agents can call internal services without hitting real endpoints.
  • Move from string assertions to schema validation using Pydantic or JSON parsing for structured outputs.
  • Split crews into reusable fixtures so multiple tests can reuse the same agent setup with different fake responses.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides