CrewAI Tutorial (Python): filtering toxic output for advanced developers

By Cyprian AaronsUpdated 2026-04-21
crewaifiltering-toxic-output-for-advanced-developerspython

This tutorial shows how to add a toxicity filter to a CrewAI workflow in Python so unsafe agent output gets blocked before it reaches a user, ticket, or downstream system. You need this when you’re building regulated or customer-facing agents and cannot afford profanity, harassment, threats, or other policy-violating text slipping through.

What You'll Need

  • Python 3.10+
  • crewai
  • openai
  • An OpenAI API key set as OPENAI_API_KEY
  • Optional but useful:
    • python-dotenv for local env loading
    • pytest for test coverage
  • A working CrewAI setup with at least one agent and one task

Step-by-Step

  1. Start with a minimal CrewAI project and install the dependencies.
    Keep the model choice simple while you build the safety layer first.
pip install crewai openai python-dotenv
export OPENAI_API_KEY="your-key-here"
  1. Create a toxicity classifier function before the crew runs.
    This example uses OpenAI’s moderation endpoint directly, which is better than trying to guess toxicity with regexes or prompt instructions.
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def is_toxic(text: str) -> bool:
    result = client.moderations.create(
        model="omni-moderation-latest",
        input=text,
    )
    flagged = result.results[0].flagged
    return bool(flagged)
  1. Define your CrewAI agent and task normally, then wrap execution with a guard.
    The key pattern is: run the crew, inspect the output, and reject or redact anything flagged as toxic.
from crewai import Agent, Task, Crew, Process

writer = Agent(
    role="Support Response Writer",
    goal="Write concise customer support replies",
    backstory="You write professional responses for banking support.",
    verbose=False,
)

task = Task(
    description="Draft a response to a frustrated customer asking for a refund.",
    expected_output="A polite customer support reply.",
    agent=writer,
)

crew = Crew(
    agents=[writer],
    tasks=[task],
    process=Process.sequential,
)

result = crew.kickoff()
output_text = str(result)
  1. Add a post-generation filter and fail closed.
    If the model output is toxic, do not forward it; replace it with a safe fallback message or route it for human review.
def safe_fallback() -> str:
    return "I’m unable to provide that response automatically. A human agent will review this request."

if is_toxic(output_text):
    final_output = safe_fallback()
else:
    final_output = output_text

print(final_output)
  1. If you want stronger control, add an input filter too.
    Filtering only the output is not enough in production because toxic prompts can still steer the agent into bad behavior.
user_input = "Write something insulting about the customer."

if is_toxic(user_input):
    raise ValueError("Blocked toxic user input")

task.description = f"Draft a response to this customer message: {user_input}"
result = crew.kickoff()
output_text = str(result)

if is_toxic(output_text):
    output_text = safe_fallback()

print(output_text)
  1. For production systems, isolate the safety check behind a small service boundary.
    That makes it easy to swap moderation providers later and keeps your CrewAI code focused on orchestration.
class ToxicityGate:
    def __init__(self):
        self.client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

    def allow(self, text: str) -> bool:
        result = self.client.moderations.create(
            model="omni-moderation-latest",
            input=text,
        )
        return not result.results[0].flagged

gate = ToxicityGate()

if gate.allow(output_text):
    print(output_text)
else:
    print(safe_fallback())

Testing It

Run the script with normal customer-support text first and confirm that clean responses pass through unchanged. Then change the task description or user input to include profanity, threats, or harassment and verify that the fallback path triggers instead of printing the raw output.

If you want more confidence, write tests around is_toxic() by mocking the moderation client and returning both flagged and unflagged results. In production, log every blocked request with request IDs so you can audit false positives without exposing unsafe content to users.

Next Steps

  • Add structured logging around moderation decisions so you can trace blocked outputs by agent, task, and tenant.
  • Move from post-generation filtering to multi-stage guardrails: input screening, tool-call screening, and output screening.
  • Add human-in-the-loop review for borderline cases instead of hard-blocking everything flagged by policy thresholds.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides