AutoGen Tutorial (Python): filtering toxic output for beginners

By Cyprian AaronsUpdated 2026-04-21
autogenfiltering-toxic-output-for-beginnerspython

This tutorial shows you how to add a toxic-output filter to an AutoGen Python agent pipeline. You need this when your agent can generate customer-facing text, internal chat replies, or support responses and you want a hard stop before harmful language gets sent out.

What You'll Need

  • Python 3.10+
  • pyautogen installed
  • An OpenAI API key set in your environment
  • Basic familiarity with AutoGen agents and ConversableAgent
  • A place to run Python scripts locally

Install the package:

pip install pyautogen

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

  1. Start with a normal AutoGen assistant agent. This gives you a baseline agent that can generate replies before you add any safety layer.
import os
from autogen import AssistantAgent

llm_config = {
    "model": "gpt-4o-mini",
    "api_key": os.environ["OPENAI_API_KEY"],
}

assistant = AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)
  1. Add a simple toxicity detector as a separate function. For beginner projects, a keyword-based filter is enough to block obvious abuse, threats, and slurs before the response leaves your app.
TOXIC_PATTERNS = [
    "kill yourself",
    "stupid",
    "idiot",
    "hate you",
    "trash",
]

def is_toxic(text: str) -> bool:
    lowered = text.lower()
    return any(pattern in lowered for pattern in TOXIC_PATTERNS)
  1. Wrap the agent reply in a guard function. The key idea is to inspect the generated text first, then either return it or replace it with a safe fallback.
def safe_generate_reply(agent, message: str) -> str:
    reply = agent.generate_reply(messages=[{"role": "user", "content": message}])

    if isinstance(reply, dict):
        content = reply.get("content", "")
    else:
        content = str(reply)

    if is_toxic(content):
        return "I can't help with that request."
    
    return content
  1. Put the guard behind a small runner script. This makes it easy to test from the command line and later move into an API endpoint, webhook handler, or chat service.
def main():
    user_message = "Write a rude response to the customer."
    response = safe_generate_reply(assistant, user_message)
    print("Assistant:", response)

if __name__ == "__main__":
    main()
  1. If you want stronger filtering, add a second pass using another model call or a moderation API. In production, I usually combine cheap keyword checks with model-based moderation for fewer false negatives.
def moderate_output(text: str) -> bool:
    # Placeholder for a real moderation service.
    # Return True if the text is allowed.
    return not is_toxic(text)

def safe_generate_reply_with_moderation(agent, message: str) -> str:
    reply = agent.generate_reply(messages=[{"role": "user", "content": message}])

    if isinstance(reply, dict):
        content = reply.get("content", "")
    else:
        content = str(reply)

    if not moderate_output(content):
        return "I can't help with that request."

    return content

Testing It

Run the script once with a harmless prompt and once with an obviously toxic prompt. You should see normal output for safe prompts and the fallback message for toxic ones.

Try changing TOXIC_PATTERNS to match language relevant to your domain, especially if you're filtering customer support or employee chat. If you move this into production, log blocked outputs so you can review false positives and improve the rules.

A good test case is to ask the assistant for “a professional apology email” versus “write an insulting reply.” The first should pass through; the second should get blocked by your filter.

Next Steps

  • Replace the keyword filter with OpenAI moderation or another policy engine.
  • Add allowlists for approved response templates in high-risk workflows.
  • Move the filter into an AutoGen multi-agent setup so one agent generates and another validates before delivery.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides