LlamaIndex Tutorial (Python): building a RAG pipeline for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
llamaindexbuilding-a-rag-pipeline-for-intermediate-developerspython

This tutorial builds a working Retrieval-Augmented Generation (RAG) pipeline with LlamaIndex in Python, from document loading to query-time retrieval and answer generation. You’d use this when you need grounded answers over your own documents instead of relying on a model’s general knowledge.

What You'll Need

  • Python 3.10+
  • A virtual environment
  • llama-index
  • An embedding model and LLM API key
  • OPENAI_API_KEY set in your environment
  • A small document set to index, such as PDFs or text files

Install the packages first:

pip install llama-index llama-index-llms-openai llama-index-embeddings-openai pypdf

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

  1. Start by loading your source documents. For a first pass, keep it simple: put a few .txt files in a local folder and let LlamaIndex read them into memory.
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()
print(f"Loaded {len(documents)} documents")
print(documents[0].text[:500])
  1. Next, configure the LLM and embedding model explicitly. This makes the pipeline predictable in production and avoids relying on hidden defaults.
import os
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

os.environ["OPENAI_API_KEY"] = os.environ["OPENAI_API_KEY"]

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
  1. Build the index from your documents. Under the hood, LlamaIndex chunks the text, embeds each chunk, and stores it in a vector index for retrieval.
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir="./storage")
print("Index built and persisted")
  1. Create a query engine and ask a question. This is where retrieval happens: relevant chunks are fetched first, then passed to the LLM to produce an answer grounded in your data.
query_engine = index.as_query_engine(similarity_top_k=3)

response = query_engine.query(
    "What are the main topics covered in these documents?"
)

print(response)
  1. If you want better control over context size and traceability, inspect retrieved nodes before generating the final answer. This is useful when debugging bad retrieval or hallucinated answers.
retriever = index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve("What are the main topics covered in these documents?")

for i, node in enumerate(nodes, start=1):
    print(f"\n--- Match {i} ---")
    print(node.node.get_content()[:400])
    print(f"Score: {node.score:.4f}")
  1. Add a chat-style interface only after retrieval works well. For most RAG systems, query quality matters more than chat memory at the start.
chat_engine = index.as_chat_engine(chat_mode="condense_question", similarity_top_k=3)

print(chat_engine.chat("Summarize the key ideas from the documents"))
print(chat_engine.chat("Which part discusses implementation details?"))

Testing It

Run the script against a small folder of plain text files first. You should see documents load successfully, an index persist to disk, and answers that quote or reflect content from your files rather than generic model output.

If retrieval looks weak, inspect the top matches with retrieve() before blaming generation. In practice, most bad RAG behavior comes from poor chunking, weak embeddings, or irrelevant source content.

A good sanity check is to ask a question that only exists in one document. If the response can point to that specific topic without drifting into unrelated text, your pipeline is working.

Next Steps

  • Add metadata filtering so you can scope retrieval by customer, policy type, region, or date.
  • Replace SimpleDirectoryReader with loaders for PDFs, HTML pages, SharePoint exports, or S3 objects.
  • Tune chunk size and overlap, then compare retrieval quality with different embedding models.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides