LangChain Tutorial (Python): filtering toxic output for advanced developers

By Cyprian AaronsUpdated 2026-04-21

langchainfiltering-toxic-output-for-advanced-developerspython

This tutorial shows you how to put a toxicity filter in front of LangChain model output in Python, then route unsafe generations into a safe fallback. You need this when your app serves user-facing text and you cannot afford profanity, harassment, self-harm encouragement, or policy-violating content slipping through.

What You'll Need

•Python 3.10+
•A working OpenAI API key
•langchain
•langchain-openai
•langchain-community
•transformers
•torch
•Optional: python-dotenv for local env loading

Install the packages:

pip install langchain langchain-openai langchain-community transformers torch python-dotenv

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

•Start with a normal LangChain chat model. The key idea is to keep generation and moderation separate so you can swap the filter later without touching prompt logic.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a concise assistant."),
    ("user", "{question}")
])

chain = prompt | llm

•Add a toxicity classifier. For production, use a dedicated moderation service; for a self-contained Python tutorial, a Hugging Face classifier gives you a real scoring layer you can gate on.

from transformers import pipeline

toxicity_classifier = pipeline(
    task="text-classification",
    model="unitary/toxic-bert",
    top_k=None,
)

def toxicity_score(text: str) -> float:
    results = toxicity_classifier(text)[0]
    toxic_label = next(
        item for item in results if item["label"].lower() in {"toxic", "toxicity"}
    )
    return float(toxic_label["score"])

•Wrap the model call with a post-generation filter. This is the simplest reliable pattern: generate once, score the output, then either return it or replace it with a safe fallback.

from langchain_core.messages import AIMessage

SAFE_FALLBACK = "I can't help with that request."

def generate_with_toxicity_filter(question: str) -> str:
    response = chain.invoke({"question": question})

    if isinstance(response, AIMessage):
        text = response.content
    else:
        text = str(response)

    score = toxicity_score(text)

    if score >= 0.70:
        return SAFE_FALLBACK

    return text

•Add input filtering too. If your users can submit abusive prompts, block those before they reach the LLM so you do not waste tokens generating unsafe completions from unsafe inputs.

def is_toxic(text: str, threshold: float = 0.70) -> bool:
    return toxicity_score(text) >= threshold

def safe_answer(question: str) -> str:
    if is_toxic(question):
        return "I can't process abusive or toxic input."

    return generate_with_toxicity_filter(question)

•Test both clean and toxic paths. You want deterministic behavior here, so keep temperature at zero and inspect the classifier scores during development.

tests = [
    "Explain how to reset my router.",
    "Write an insulting reply to my coworker.",
]

for q in tests:
    answer = safe_answer(q)
    print("Q:", q)
    print("A:", answer)
    print("-" * 40)

Testing It

Run the script and verify that benign prompts pass through unchanged while abusive prompts return the fallback string. If you want more visibility, print the raw toxicity score before the threshold check and confirm that borderline outputs are being blocked consistently.

Also test edge cases like quoted abuse, mixed-language insults, and benign prompts that contain risky vocabulary in an educational context. That tells you whether your threshold is too aggressive and whether you need allowlist rules for domains like cybersecurity or medical education.

Next Steps

•Move moderation to an async middleware layer so you can filter both streaming and non-streaming responses.
•Replace the local classifier with OpenAI moderation or another hosted policy engine for lower operational risk.
•Add structured logging for prompt text, response text, scores, and final decision so you can audit false positives and false negatives.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit