CrewAI Tutorial (Python): filtering toxic output for intermediate developers
This tutorial shows you how to add a toxicity filter to a CrewAI workflow in Python so unsafe or abusive agent output gets blocked before it reaches users. You need this when an agent is generating customer-facing text, support replies, or internal recommendations and you want a deterministic guardrail between generation and delivery.
What You'll Need
- •Python 3.10+
- •
crewai - •
python-dotenv - •An LLM API key for the model you want CrewAI to use, such as:
- •
OPENAI_API_KEY
- •
- •Basic familiarity with:
- •
Agent - •
Task - •
Crew - •
Process
- •
- •A terminal and a virtual environment
Install the packages:
pip install crewai python-dotenv
Step-by-Step
- •Start by loading your API key and defining two agents: one to generate content and one to review it for toxic language. The second agent is not there to “be smart”; it exists to make a binary decision about whether the text is safe enough to pass through.
from dotenv import load_dotenv
from crewai import Agent, Task, Crew, Process
load_dotenv()
writer = Agent(
role="Customer Support Writer",
goal="Draft concise customer-facing responses",
backstory="You write clear, professional support replies.",
verbose=True,
)
toxicity_reviewer = Agent(
role="Toxicity Reviewer",
goal="Detect toxic, abusive, insulting, or hostile language",
backstory="You are strict and only allow safe text through.",
verbose=True,
)
- •Create a writing task and a review task. The key pattern here is that the reviewer receives the writer’s output and returns a structured verdict that your application can enforce.
write_task = Task(
description=(
"Write a short response to an upset customer asking for a refund. "
"Keep it professional and under 80 words."
),
expected_output="A customer support reply in plain English.",
agent=writer,
)
review_task = Task(
description=(
"Review the draft for toxic, insulting, rude, aggressive, or abusive language. "
"Return exactly one of these values: SAFE or UNSAFE."
),
expected_output="SAFE or UNSAFE only.",
agent=toxicity_reviewer,
context=[write_task],
)
- •Run the crew sequentially so the reviewer sees the writer’s output. For production use, I prefer sequential flow here because it keeps the control path simple and makes post-processing predictable.
crew = Crew(
agents=[writer, toxicity_reviewer],
tasks=[write_task, review_task],
process=Process.sequential,
verbose=True,
)
result = crew.kickoff()
print(result)
- •Add a hard gate in your application code. CrewAI can generate the review result, but your app should be the one deciding whether to return content or block it.
review_text = str(result).strip().upper()
if "UNSAFE" in review_text:
final_output = "Blocked: toxic output detected."
elif "SAFE" in review_text:
final_output = "Approved: output passed toxicity review."
else:
final_output = f"Blocked: unexpected reviewer response -> {review_text}"
print(final_output)
- •If you want stricter behavior, force structured output from the reviewer and validate it with plain Python before releasing any text. This reduces ambiguity when the model tries to be helpful instead of following instructions exactly.
def is_safe_review(text: str) -> bool:
normalized = text.strip().upper()
return normalized == "SAFE"
reviewer_response = str(result).strip()
if is_safe_review(reviewer_response):
print("Approved for delivery.")
else:
print("Blocked for manual review.")
Testing It
Run the script once with a normal prompt and confirm you get SAFE from the reviewer and an approved final decision. Then change the writing task to request hostile wording like “respond angrily” or “insult the customer” and verify that your gate blocks delivery.
Also check edge cases where the reviewer returns extra text instead of just SAFE or UNSAFE. Your enforcement code should treat anything outside those exact values as blocked.
Next Steps
- •Add a second moderation layer using an external safety API before sending text to users.
- •Store blocked outputs in logs with request IDs so you can audit false positives.
- •Extend this pattern to classify hate speech, harassment, self-harm, and policy violations separately instead of using one generic toxicity label.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit