CrewAI Tutorial (Python): adding observability for advanced developers

By Cyprian AaronsUpdated 2026-04-21

crewaiadding-observability-for-advanced-developerspython

This tutorial shows how to wire observability into a CrewAI project so you can trace agent runs, inspect tool calls, and debug failures without guessing. You need this when your crew moves past toy demos and starts handling real workflows where latency, retries, and bad tool outputs matter.

What You'll Need

•Python 3.10+
•A CrewAI project with crewai installed
•An OpenAI API key for the LLM
•Optional but recommended: a Langfuse account for tracing
•Environment variables set locally or in your deployment platform
•Basic familiarity with Agent, Task, and Crew in CrewAI

Step-by-Step

•Install the dependencies and keep observability libraries pinned. For production work, don’t install tracing packages ad hoc inside notebooks; make them part of your lockfile so traces are reproducible across environments.

pip install crewai langfuse python-dotenv

•Create a .env file with your model and observability settings. CrewAI will read the LLM key, and Langfuse will pick up its own credentials from environment variables.

OPENAI_API_KEY=your_openai_api_key
LANGFUSE_PUBLIC_KEY=your_langfuse_public_key
LANGFUSE_SECRET_KEY=your_langfuse_secret_key
LANGFUSE_HOST=https://cloud.langfuse.com

•Initialize Langfuse before you create agents or crews. The key point is ordering: instrumentation needs to be active before the first model call, otherwise the earliest spans never get captured.

from dotenv import load_dotenv
load_dotenv()

from langfuse import Langfuse

langfuse = Langfuse()
langfuse.auth_check()

print("Langfuse connected")

•Build a small crew with one tool-backed agent so you can see both LLM activity and tool execution in traces. This example uses a deterministic Python function as a tool, which is easier to verify than a network call.

from crewai import Agent, Task, Crew, Process
from crewai.tools import tool

@tool("calculate_risk_score")
def calculate_risk_score(amount: int) -> str:
    """Return a simple risk label for an amount."""
    if amount >= 100000:
        return "high"
    if amount >= 25000:
        return "medium"
    return "low"

analyst = Agent(
    role="Risk Analyst",
    goal="Assess transaction risk clearly",
    backstory="You review transactions for operational risk.",
    tools=[calculate_risk_score],
    verbose=True,
)

task = Task(
    description="Assess the risk of a 50000 USD transfer and explain why.",
    expected_output="A concise risk assessment with the label and reasoning.",
    agent=analyst,
)

crew = Crew(
    agents=[analyst],
    tasks=[task],
    process=Process.sequential,
    verbose=True,
)

•Add explicit trace metadata around the run so you can correlate one crew execution with logs, alerts, or user sessions. In real systems, this is where you attach request IDs, customer IDs, or workflow names.

import uuid

run_id = str(uuid.uuid4())

result = crew.kickoff(inputs={
    "run_id": run_id,
})

print("\n=== Crew Result ===")
print(result)

•If you want stronger observability than passive tracing, wrap the kickoff in your own application span and log the important inputs and outputs. This gives you one place to attach business context even when CrewAI internals change.

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("crew_observability")

logger.info("starting_crew_run", extra={"run_id": run_id})
result = crew.kickoff(inputs={"run_id": run_id})
logger.info("finished_crew_run", extra={"run_id": run_id, "output": str(result)})

Testing It

Run the script once from your terminal and confirm two things: the console output shows the final crew response, and Langfuse receives a new trace for that execution. If the trace is missing, check that .env loaded correctly before Langfuse() was instantiated.

Then force an error by changing the tool input or breaking the API key on purpose. A good observability setup should make failure mode obvious: no silent hangs, no empty traces, and enough context to identify whether the issue came from the prompt, tool logic, or model access.

If you’re using verbose mode correctly, you should also see agent-level reasoning steps in stdout while Langfuse captures structured telemetry separately. That split matters because stdout is useful locally, but traces are what you use when debugging production incidents.

Next Steps

•Add custom tags and metadata per tenant or workflow so traces are searchable in production.
•Instrument tool functions with their own spans when they call databases, queues, or internal APIs.
•Export traces into your incident pipeline so failed crew runs page on-call with full context.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit