AutoGen Tutorial (Python): filtering toxic output for advanced developers

By Cyprian AaronsUpdated 2026-04-21
autogenfiltering-toxic-output-for-advanced-developerspython

This tutorial shows how to intercept AutoGen agent output, classify it for toxicity, and block or redact unsafe messages before they reach a user or downstream system. You need this when you’re building agent workflows for regulated environments where a model can generate harassment, hate, sexual content, or other policy-violating text that must never be displayed raw.

What You'll Need

  • Python 3.10+
  • pyautogen installed
  • An OpenAI-compatible API key
  • A moderation provider:
    • OpenAI Moderation API, or
    • your own classifier endpoint
  • Basic familiarity with:
    • AssistantAgent
    • UserProxyAgent
    • GroupChat / GroupChatManager
  • Optional but useful:
    • python-dotenv for local secrets
    • logging configured in your app

Step-by-Step

  1. Install the packages and set your API key.
    This example uses AutoGen plus the OpenAI Python SDK for moderation calls.
pip install pyautogen openai python-dotenv
export OPENAI_API_KEY="your-api-key"
  1. Build a small moderation wrapper around the OpenAI Moderation API.
    Keep this separate from your agent logic so you can swap providers later without touching orchestration code.
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def is_toxic(text: str) -> tuple[bool, dict]:
    response = client.moderations.create(
        model="omni-moderation-latest",
        input=text,
    )
    result = response.results[0]
    flagged = bool(result.flagged)
    categories = result.categories.model_dump() if hasattr(result.categories, "model_dump") else dict(result.categories)
    return flagged, categories

sample = "I will destroy you."
flagged, categories = is_toxic(sample)
print({"flagged": flagged, "categories": categories})
  1. Create an AutoGen assistant and a message filter that screens every assistant reply before it is returned.
    The important pattern here is not “trust the agent”; it’s “inspect the generated text before any handoff.”
import autogen

config_list = [{
    "model": "gpt-4o-mini",
    "api_key": os.environ["OPENAI_API_KEY"],
}]

llm_config = {
    "config_list": config_list,
    "temperature": 0,
}

assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)

user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
)

def sanitize_reply(reply: str) -> str:
    flagged, categories = is_toxic(reply)
    if flagged:
        return "[REDACTED: unsafe model output]"
    return reply
  1. Wrap the conversation loop so every assistant message is filtered before display or persistence.
    In production, this is where you’d also emit metrics and store the original text only in restricted audit logs.
def run_safe_chat(task: str) -> None:
    user_proxy.initiate_chat(assistant, message=task)

    last_message = assistant.last_message(user_proxy)
    if not last_message:
        print("No assistant output.")
        return

    content = last_message.get("content", "")
    safe_content = sanitize_reply(content)

    print("Raw output:")
    print(content)
    print("\nSafe output:")
    print(safe_content)

run_safe_chat("Write a short response insulting a difficult customer.")
  1. Add a stricter policy layer for high-risk categories instead of a single binary flag.
    For bank and insurance workflows, you usually want different actions for harassment, self-harm, sexual content, and threats.
BLOCKED_CATEGORIES = {
    "harassment": True,
    "hate": True,
    "self-harm": True,
    "sexual": True,
}

def policy_action(text: str) -> str:
    flagged, categories = is_toxic(text)
    if not flagged:
        return text

    for category, enabled in BLOCKED_CATEGORIES.items():
        if enabled and categories.get(category):
            return f"[BLOCKED: {category}]"

    return "[REVIEW REQUIRED]"

test_cases = [
    "Hello, how can I help you today?",
    "You are worthless and stupid.",
]

for item in test_cases:
    print(policy_action(item))
  1. If you use group chat or multi-agent routing, filter at the manager boundary too.
    That prevents one bad agent from poisoning another agent’s context window or leaking toxic content into summaries.
groupchat = autogen.GroupChat(
    agents=[assistant, user_proxy],
    messages=[],
    max_round=2,
)

manager = autogen.GroupChatManager(
    groupchat=groupchat,
    llm_config=llm_config,
)

user_proxy.initiate_chat(
    manager,
    message="Draft a customer support reply."
)

Testing It

Run the script with both safe and unsafe prompts. A safe prompt should pass through unchanged, while an unsafe prompt should come back as [REDACTED: unsafe model output], [BLOCKED: ...], or [REVIEW REQUIRED] depending on your policy.

Test edge cases too. Try insults with mild profanity, explicit threats, and borderline content so you can see whether your moderation thresholds are too permissive or too aggressive.

If you’re using this in production, log three things for every response: the raw model output, the moderation decision, and the final delivered text. That gives you traceability when compliance asks why something was blocked.

Next Steps

  • Add structured logging with request IDs so you can trace toxic generations across multi-agent runs.
  • Replace the binary moderation check with category-specific routing rules and escalation workflows.
  • Put the filter behind a reusable middleware layer so every agent call in your codebase gets screened consistently.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides