LlamaIndex Tutorial (Python): building a RAG pipeline for advanced developers
This tutorial builds a production-shaped Retrieval-Augmented Generation pipeline with LlamaIndex in Python: ingest documents, index them, retrieve relevant chunks, and answer questions with citations. You’d use this when basic “chat with PDFs” demos are not enough and you need a structure you can extend for reranking, metadata filtering, and more reliable retrieval.
What You'll Need
- •Python 3.10+
- •
llama-index - •
llama-index-llms-openai - •
llama-index-embeddings-openai - •
openaiAPI key - •A local dataset or a folder of
.txt,.md, or.pdffiles - •Optional but useful:
- •
pydantic - •
chromadbif you want persistent vector storage later
- •
Install the core packages:
pip install llama-index llama-index-llms-openai llama-index-embeddings-openai
Set your API key:
export OPENAI_API_KEY="your-key-here"
Step-by-Step
- •Start by loading documents from disk and splitting them into manageable chunks. For RAG, chunking matters more than most people think: too large and retrieval gets noisy, too small and you lose context.
from llama_index.core import SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
documents = SimpleDirectoryReader("./data").load_data()
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=64)
print(f"Loaded {len(documents)} documents")
- •Build a vector index from those documents. This is the retrieval layer: it converts text into embeddings and stores them so the query engine can fetch relevant context fast.
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
# Persist later if needed:
# index.storage_context.persist(persist_dir="./storage")
print("Index built successfully")
- •Create a retriever and inspect what it returns before wiring in generation. Advanced RAG work starts here: if retrieval is weak, the answer quality will be weak no matter how good your model is.
retriever = index.as_retriever(similarity_top_k=3)
query = "What does the policy say about document retention?"
nodes = retriever.retrieve(query)
for i, node in enumerate(nodes, start=1):
print(f"\nResult {i}:")
print(node.score)
print(node.node.get_text()[:400])
- •Add a query engine with source citations. This turns retrieval into an answerable workflow and gives you traceability, which matters when users ask where a claim came from.
query_engine = index.as_query_engine(
similarity_top_k=3,
response_mode="compact",
)
response = query_engine.query("Summarize the retention policy in 3 bullets.")
print(response)
if hasattr(response, "source_nodes"):
for i, source in enumerate(response.source_nodes, start=1):
print(f"\nSource {i}: score={source.score}")
print(source.node.metadata)
- •Add metadata-aware filtering so your RAG pipeline can separate content by source type, department, or document class. This is the first step toward enterprise-grade retrieval instead of one giant pile of text.
from llama_index.core.schema import MetadataFilter, MetadataFilters
filters = MetadataFilters(filters=[
MetadataFilter(key="file_name", value="policy.md")
])
filtered_retriever = index.as_retriever(
similarity_top_k=5,
filters=filters,
)
filtered_nodes = filtered_retriever.retrieve("What are the retention rules?")
for node in filtered_nodes:
print(node.node.metadata)
print(node.node.get_text()[:200])
Testing It
Run a few queries that should clearly map to specific parts of your source documents. You want to see relevant chunks returned before you even look at the final generated answer.
Check that the response includes source nodes and that those sources actually contain the facts being stated. If the model hallucinates or cites irrelevant chunks, fix chunking, metadata quality, or retrieval settings before touching prompts.
Try one broad question and one narrow question. Broad questions test synthesis; narrow questions test whether retrieval can pull exact facts from the right chunk.
If results are unstable across runs, make sure your model temperature is set to zero and your documents are cleanly parsed. Bad OCR output or messy markdown will poison retrieval quickly.
Next Steps
- •Add a reranker such as a cross-encoder or LLM-based reranking layer to improve top-k precision.
- •Persist the index with a real vector store like Chroma or PostgreSQL for repeatable deployments.
- •Add evaluation with labeled Q&A pairs so you can measure retrieval hit rate and answer faithfulness.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit