LangGraph Tutorial (Python): filtering toxic output for advanced developers

By Cyprian AaronsUpdated 2026-04-22
langgraphfiltering-toxic-output-for-advanced-developerspython

This tutorial shows how to build a LangGraph pipeline in Python that detects toxic model output, routes it through a moderation node, and either redacts or blocks the response before it reaches the user. You need this when you have an LLM agent that can generate unsafe language and you want deterministic control over what leaves your system.

What You'll Need

  • Python 3.10+
  • langgraph
  • langchain-openai
  • openai API key
  • python-dotenv for local development
  • A working terminal and virtual environment

Install the packages:

pip install langgraph langchain-openai openai python-dotenv

Set your API key:

export OPENAI_API_KEY="your-key"

Step-by-Step

  1. Start with a minimal graph state that carries the user input, raw model output, toxicity flag, and final response. Keep the state explicit; that makes moderation logic easy to test and audit.
from typing import TypedDict

class GraphState(TypedDict):
    user_input: str
    raw_output: str
    is_toxic: bool
    final_output: str
  1. Build a generation node that calls an LLM. For production filtering, don’t trust the first response; always pass it through a separate moderation step.
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def generate_response(state: GraphState) -> dict:
    prompt = (
        "Answer the user's request clearly and concisely.\n"
        f"User: {state['user_input']}"
    )
    response = llm.invoke(prompt)
    return {"raw_output": response.content}
  1. Add a deterministic toxicity check. This example uses a simple keyword-based filter so the tutorial is executable as-written without extra dependencies, but in production you should replace it with a classifier or moderation API.
TOXIC_TERMS = {
    "idiot",
    "stupid",
    "hate you",
    "kill yourself",
    "moron",
}

def moderate_output(state: GraphState) -> dict:
    text = state["raw_output"].lower()
    is_toxic = any(term in text for term in TOXIC_TERMS)
    return {"is_toxic": is_toxic}
  1. Route based on the moderation result. If the output is toxic, block it or replace it with a safe fallback. If it passes, return it unchanged.
def route_after_moderation(state: GraphState) -> str:
    return "block" if state["is_toxic"] else "allow"

def block_toxic_output(state: GraphState) -> dict:
    return {
        "final_output": (
            "I can't provide that response because it violates safety policy."
        )
    }

def allow_output(state: GraphState) -> dict:
    return {"final_output": state["raw_output"]}
  1. Wire everything together with LangGraph. This graph has one generation node, one moderation node, and two terminal branches.
from langgraph.graph import StateGraph, START, END

graph = StateGraph(GraphState)

graph.add_node("generate", generate_response)
graph.add_node("moderate", moderate_output)
graph.add_node("block", block_toxic_output)
graph.add_node("allow", allow_output)

graph.add_edge(START, "generate")
graph.add_edge("generate", "moderate")
graph.add_conditional_edges(
    "moderate",
    route_after_moderation,
    {
        "block": "block",
        "allow": "allow",
    },
)
graph.add_edge("block", END)
graph.add_edge("allow", END)

app = graph.compile()
  1. Run the graph with a sample input and inspect both the raw and filtered outputs. This gives you an auditable trail for debugging moderation failures.
if __name__ == "__main__":
    result = app.invoke(
        {
            "user_input": "Write a polite reply to an angry customer.",
            "raw_output": "",
            "is_toxic": False,
            "final_output": "",
        }
    )

    print("RAW OUTPUT:")
    print(result["raw_output"])
    print("\nIS TOXIC:")
    print(result["is_toxic"])
    print("\nFINAL OUTPUT:")
    print(result["final_output"])

Testing It

Run the script once with a normal prompt and confirm final_output matches raw_output. Then test with prompts likely to produce abusive language and verify the graph replaces the content with the safe fallback instead of returning it directly.

For deeper validation, log every state transition so you can see exactly where filtering happened. In regulated environments, I also recommend unit tests for route_after_moderation, because routing bugs are easier to catch there than in end-to-end runs.

A good test matrix includes:

  • benign customer support requests
  • adversarial prompts trying to force insults
  • long outputs where toxic terms appear late in the response

Next Steps

  • Replace the keyword filter with a real moderation model or policy classifier.
  • Add pre-generation prompt filtering so toxic user input never reaches the LLM.
  • Persist moderation decisions in your observability stack for audit and incident review.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides