LangChain Tutorial (Python): filtering toxic output for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
langchainfiltering-toxic-output-for-intermediate-developerspython

This tutorial shows you how to build a LangChain-based Python pipeline that detects and blocks toxic model output before it reaches your user. You need this when your app generates customer-facing text, internal assistant replies, or regulated content where unsafe language can create legal, reputational, or policy issues.

What You'll Need

  • Python 3.10+
  • An OpenAI API key set as OPENAI_API_KEY
  • langchain, langchain-openai, and langchain-community
  • A terminal with pip
  • Basic familiarity with LangChain Runnable chains and prompt templates

Install the packages:

pip install langchain langchain-openai langchain-community

Step-by-Step

  1. Start with a normal generation chain. The key idea is simple: let the model generate text, then run that output through a toxicity filter before returning it.
import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("user", "{question}")
])

chain = prompt | llm

response = chain.invoke({"question": "Write a short reply to an angry customer."})
print(response.content)
  1. Add a toxicity classifier using Hugging Face through LangChain. This gives you a second pass that scores the generated text instead of trusting the LLM output blindly.
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline

toxicity_pipe = pipeline(
    "text-classification",
    model="unitary/toxic-bert",
    top_k=None,
    truncation=True
)

toxicity_model = HuggingFacePipeline(pipeline=toxicity_pipe)

def is_toxic(text: str) -> bool:
    result = toxicity_model.invoke(text)
    labels = {item["label"].lower(): item["score"] for item in result}
    toxic_score = labels.get("toxic", 0.0)
    return toxic_score >= 0.5
  1. Wrap generation and filtering into one function. This is the production pattern you want: generate first, inspect second, then either return the output or replace it with a safe fallback.
def safe_generate(question: str) -> str:
    raw = chain.invoke({"question": question}).content

    if is_toxic(raw):
        return "I can’t help with that request."

    return raw

print(safe_generate("Write a short reply to an angry customer."))
  1. Add a moderation-style fallback for better control. In real systems, you usually want to log toxic outputs and keep the user experience consistent with a neutral response.
def safe_generate_with_logging(question: str) -> str:
    raw = chain.invoke({"question": question}).content

    if is_toxic(raw):
        print(f"[blocked] toxic output detected: {raw}")
        return (
            "I’m unable to provide that response. "
            "Please rephrase your request."
        )

    print(f"[allowed] {raw}")
    return raw

print(safe_generate_with_logging("Explain how to insult someone professionally"))
  1. If you want tighter control, use LangChain’s runnable composition to keep filtering inside the chain flow. This makes it easier to extend later with retries, structured outputs, or human review.
from langchain_core.runnables import RunnableLambda

def generate_text(inputs: dict) -> str:
    return chain.invoke(inputs).content

def filter_text(text: str) -> str:
    if is_toxic(text):
        return "Blocked by safety filter."
    return text

safe_chain = RunnableLambda(generate_text) | RunnableLambda(filter_text)

print(safe_chain.invoke({"question": "Write something rude about a coworker."}))

Testing It

Run the script with both benign and clearly unsafe prompts. A safe prompt should pass through unchanged, while an abusive or hateful prompt should trigger your fallback response and print a blocked log line.

Test edge cases too, because toxicity filters are not perfect:

  • sarcastic but harmless text
  • profanity used in non-abusive context
  • long responses where only one sentence is toxic

If you’re using this in production, capture:

  • original prompt
  • raw model output
  • classifier score
  • final returned response

That gives you auditability when users complain about false positives or false negatives.

Next Steps

  • Add threshold tuning per content category instead of using one global 0.5 cutoff.
  • Replace binary blocking with a repair step that rewrites toxic text into neutral language.
  • Add LangSmith tracing so you can inspect exactly where unsafe content enters the pipeline.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides