LlamaIndex Tutorial (Python): filtering toxic output for beginners
This tutorial shows you how to add a toxicity filter to a LlamaIndex-powered Python app so unsafe model output gets blocked or rewritten before your user sees it. You need this when you’re building chatbots, internal copilots, or support tools and want a simple guardrail for profanity, harassment, or other toxic language.
What You'll Need
- •Python 3.10+
- •A virtual environment
- •
llama-index - •
openaiif you want to use an OpenAI-backed LLM - •An OpenAI API key set as
OPENAI_API_KEY - •Basic familiarity with
VectorStoreIndex,QueryEngine, and LlamaIndex callbacks
Install the packages:
pip install llama-index openai
Step-by-Step
- •Start by loading a small document set and building a normal query engine. This gives you a baseline before adding any filtering logic.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the most important points.")
print(response)
- •Next, add a toxicity detector that runs on the final answer text. For beginners, the simplest reliable pattern is a keyword-based filter with a denylist and a fallback message.
TOXIC_PATTERNS = [
"idiot",
"stupid",
"hate you",
"kill yourself",
"moron",
]
def is_toxic(text: str) -> bool:
lowered = text.lower()
return any(pattern in lowered for pattern in TOXIC_PATTERNS)
def filter_toxic_output(text: str) -> str:
if is_toxic(text):
return "I can't provide that response."
return text
- •Wrap the query engine so every response passes through your filter before returning to the caller. This keeps the guardrail in one place and avoids scattering checks across your app.
from llama_index.core.base.response.schema import Response
class SafeQueryEngine:
def __init__(self, query_engine):
self.query_engine = query_engine
def query(self, prompt: str) -> Response:
response = self.query_engine.query(prompt)
safe_text = filter_toxic_output(str(response))
response.response = safe_text
return response
safe_query_engine = SafeQueryEngine(query_engine)
result = safe_query_engine.query("Write a rude insult about my coworker.")
print(result.response)
- •If you want better coverage than keywords, add an LLM-based moderation pass. This is useful when toxic content is subtle, but keep it as a second layer because it adds latency and cost.
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini")
def llm_moderate(text: str) -> bool:
prompt = (
"Classify this text as toxic or safe. "
"Reply with only TOXIC or SAFE.\n\n"
f"Text: {text}"
)
verdict = llm.complete(prompt).text.strip().upper()
return verdict == "TOXIC"
def filter_with_llm(text: str) -> str:
if llm_moderate(text):
return "I can't provide that response."
return text
- •Combine both layers for production-style behavior: fast keyword screening first, then LLM moderation only when needed. That keeps your common path cheap and makes the system easier to reason about.
def layered_filter(text: str) -> str:
if is_toxic(text):
return "I can't provide that response."
if len(text) > 200 and llm_moderate(text):
return "I can't provide that response."
return text
class LayeredSafeQueryEngine:
def __init__(self, query_engine):
self.query_engine = query_engine
def query(self, prompt: str):
response = self.query_engine.query(prompt)
response.response = layered_filter(str(response))
return response
layered_safe_query_engine = LayeredSafeQueryEngine(query_engine)
print(layered_safe_query_engine.query("Give me an insulting reply.").response)
Testing It
Test with three kinds of prompts: normal business questions, obviously toxic prompts, and borderline phrasing that should not be blocked. You want to confirm the safe path returns real answers while toxic output gets replaced with your fallback message.
A quick manual test is enough at first:
tests = [
"Summarize this document.",
"Write an insult calling someone stupid.",
"Explain why rude language is harmful.",
]
for prompt in tests:
result = layered_safe_query_engine.query(prompt).response
print(f"\nPROMPT: {prompt}\nRESPONSE: {result}")
If you’re using real user traffic later, log blocked responses and review false positives. That tells you whether your denylist is too aggressive or too weak for your domain.
Next Steps
- •Add input filtering too, so toxic prompts never reach the retriever or LLM.
- •Replace keyword matching with a proper moderation model or policy classifier.
- •Store moderation events in your observability stack so compliance teams can audit them.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit