LlamaIndex Tutorial (Python): filtering toxic output for advanced developers
This tutorial shows you how to add a toxic-output filter to a LlamaIndex-powered Python app, so unsafe generations can be detected before they reach users. You need this when your agent answers customer-facing questions, drafts regulated content, or sits behind a workflow where abusive, hateful, or self-harm content is unacceptable.
What You'll Need
- •Python 3.10+
- •A virtual environment
- •
llama-index - •An LLM provider API key:
- •OpenAI:
OPENAI_API_KEY
- •OpenAI:
- •Basic familiarity with:
- •
VectorStoreIndex - •
QueryEngine - •
ResponseSynthesizer
- •
- •Optional but useful:
- •A local moderation model or a moderation API if you want stronger policy checks
Install the packages:
pip install llama-index openai
Step-by-Step
- •First, build a normal LlamaIndex query pipeline. The point is not to change retrieval; it’s to insert a guardrail after generation so you can inspect the final answer before returning it.
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# Make sure your OpenAI key is set in the environment
assert os.getenv("OPENAI_API_KEY"), "Set OPENAI_API_KEY before running this script"
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("Summarize the policy for customer complaints.")
print(response)
- •Next, define a toxicity classifier. For production systems, keep this separate from your main generation path so you can swap in different policies without touching retrieval code.
from dataclasses import dataclass
TOXIC_PATTERNS = [
"kill yourself",
"i hate you",
"stupid",
"idiot",
"racial slur",
"sexist",
]
@dataclass
class ToxicityResult:
toxic: bool
matched_pattern: str | None = None
def detect_toxicity(text: str) -> ToxicityResult:
lowered = text.lower()
for pattern in TOXIC_PATTERNS:
if pattern in lowered:
return ToxicityResult(toxic=True, matched_pattern=pattern)
return ToxicityResult(toxic=False)
- •Wrap the query engine with a post-generation filter. This keeps your app simple: generate first, inspect second, and replace unsafe output with a safe refusal or fallback.
from typing import Any
SAFE_FALLBACK = (
"I can't provide that response. "
"Please rephrase your request or ask for help with a safe alternative."
)
def guarded_query(query_engine: Any, prompt: str) -> str:
response = query_engine.query(prompt)
text = str(response)
result = detect_toxicity(text)
if result.toxic:
print(f"Blocked toxic output due to pattern: {result.matched_pattern}")
return SAFE_FALLBACK
return text
answer = guarded_query(query_engine, "Write an angry reply to the customer.")
print(answer)
- •Add input filtering too. Output filtering is necessary, but it is not enough; if the user prompt is toxic, you should block early and avoid sending that request into your agent stack at all.
def guarded_app(query_engine: Any, prompt: str) -> str:
input_check = detect_toxicity(prompt)
if input_check.toxic:
return SAFE_FALLBACK
response_text = guarded_query(query_engine, prompt)
return response_text
print(guarded_app(query_engine, "Explain our refund policy politely."))
print(guarded_app(query_engine, "You are stupid."))
- •If you want stricter control, move from keyword matching to structured moderation decisions. In practice, you’ll often combine a rules layer with an LLM-based classifier or provider moderation endpoint so borderline content gets reviewed instead of blindly passed through.
def classify_risk(text: str) -> dict[str, object]:
result = detect_toxicity(text)
return {
"toxic": result.toxic,
"reason": result.matched_pattern,
"action": "block" if result.toxic else "allow",
}
samples = [
"Please summarize the refund process.",
"You are an idiot.",
]
for sample in samples:
print(sample)
print(classify_risk(sample))
Testing It
Run three tests: one benign prompt, one toxic user prompt, and one prompt likely to trigger unsafe generation from your downstream model. The benign request should pass through unchanged, while toxic inputs should return your fallback message instead of raw model output.
Also test against real documents in ./data, because retrieved context can influence the model toward unsafe phrasing even when the user prompt is clean. In production, log blocked prompts and blocked generations separately so you can tune thresholds and review false positives.
A good smoke test is to temporarily inject a known toxic string into your document corpus and confirm the post-generation filter catches it before the answer leaves your service.
Next Steps
- •Replace keyword matching with a real moderation model or provider moderation API.
- •Add structured logging around blocked prompts, blocked outputs, and user/session IDs.
- •Extend the guardrail to inspect retrieved chunks before synthesis, not just final responses.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit