AutoGen Tutorial (Python): filtering toxic output for intermediate developers

By Cyprian AaronsUpdated 2026-04-21

autogenfiltering-toxic-output-for-intermediate-developerspython

This tutorial shows you how to put a toxicity filter in front of an AutoGen agent workflow so unsafe or abusive model output never reaches your users. You need this when you’re building assistants for regulated environments, customer support, internal tools, or any system where a single bad response can become a compliance or trust problem.

What You'll Need

•Python 3.10+
•pyautogen installed
•An OpenAI API key with access to a chat model
•python-dotenv if you want to load secrets from .env
•A basic AutoGen setup with at least one assistant agent and one user proxy agent

Install the packages:

pip install pyautogen python-dotenv openai

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

•Start by creating a normal AutoGen assistant and user proxy. The key idea is that we do not trust the assistant output directly; we inspect it before showing it to the user or passing it downstream.

from autogen import AssistantAgent, UserProxyAgent

llm_config = {
    "model": "gpt-4o-mini",
    "temperature": 0,
    "api_key": os.environ["OPENAI_API_KEY"],
}

assistant = AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
    code_execution_config=False,
)

•Add a lightweight toxicity classifier. For production, you would likely use a moderation endpoint or an internal policy service, but this example keeps everything in Python so you can run it end-to-end. The function returns True when content should be blocked.

import os
import re

TOXIC_PATTERNS = [
    r"\bidiot\b",
    r"\bstupid\b",
    r"\bkill yourself\b",
    r"\bhate you\b",
]

def is_toxic(text: str) -> bool:
    lowered = text.lower()
    for pattern in TOXIC_PATTERNS:
        if re.search(pattern, lowered):
            return True
    return False

•Wrap the assistant call so every response gets filtered before use. This is the part most teams skip: they generate the answer and immediately print it. Instead, capture the text, run policy checks, and only then release it.

def safe_generate_reply(prompt: str) -> str:
    messages = [{"role": "user", "content": prompt}]
    reply = assistant.generate_reply(messages=messages)

    if isinstance(reply, dict):
        content = reply.get("content", "")
    else:
        content = str(reply)

    if is_toxic(content):
        return "[BLOCKED] Response contained disallowed language."

    return content


prompt = "Write a short reply to a frustrated customer asking for help."
print(safe_generate_reply(prompt))

•If you want stronger control, filter both incoming prompts and outgoing responses. That matters in real systems because toxicity can come from either side: user input can be abusive, and model output can echo or amplify it.

def guarded_chat(prompt: str) -> str:
    if is_toxic(prompt):
        return "[BLOCKED] User input contained disallowed language."

    response = safe_generate_reply(prompt)

    if response.startswith("[BLOCKED]"):
        return response

    return response


tests = [
    "Help me reset my password.",
    "You are stupid and useless.",
]

for t in tests:
    print("INPUT:", t)
    print("OUTPUT:", guarded_chat(t))

•For better operational visibility, log blocked events with enough context to debug them later. In production, send these records to your app logs or SIEM instead of printing them.

from datetime import datetime

def log_block(event_type: str, text: str) -> None:
    print({
        "timestamp": datetime.utcnow().isoformat(),
        "event_type": event_type,
        "sample": text[:120],
        "blocked": True,
    })

def guarded_chat_with_logging(prompt: str) -> str:
    if is_toxic(prompt):
        log_block("input_blocked", prompt)
        return "[BLOCKED] User input contained disallowed language."

    response = safe_generate_reply(prompt)

    if response.startswith("[BLOCKED]"):
        log_block("output_blocked", response)
        return response

    return response

Testing It

Run the script with one clean prompt and one toxic prompt. The clean prompt should pass through unchanged, while the toxic one should return the blocked message instead of model output.

Also test edge cases like capitalization, punctuation, and mixed wording such as "You are STUPID!" or "I hate you.". If those still slip through, tighten your matcher or replace it with a moderation API that scores severity instead of relying on regex alone.

In a real AutoGen app, place this guard at every boundary where text leaves the model layer. That includes chat responses, tool outputs that get summarized by an LLM, and any agent-to-agent handoff that could expose unsafe language.

Next Steps

•Replace the regex filter with a moderation endpoint from your model provider.
•Add severity thresholds and category-based blocking for harassment, self-harm, and sexual content.
•Put the guard behind a reusable middleware function so every agent in your AutoGen stack uses the same policy layer.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit