LangChain Tutorial (Python): building a RAG pipeline for advanced developers

By Cyprian AaronsUpdated 2026-04-21

langchainbuilding-a-rag-pipeline-for-advanced-developerspython

This tutorial builds a production-grade Retrieval-Augmented Generation (RAG) pipeline in Python with LangChain, using real document loading, chunking, embeddings, vector search, and answer generation. You need this when a plain LLM is not enough and you want responses grounded in your own documents, policies, or knowledge base.

What You'll Need

•Python 3.10+
•An OpenAI API key set as OPENAI_API_KEY
•
These packages:
- •langchain
- •langchain-openai
- •langchain-community
- •faiss-cpu
- •pypdf
•A few local documents to index, such as PDFs or text files
•Basic familiarity with LangChain chains and retrievers

Step-by-Step

•Start by loading documents from disk and splitting them into retrieval-friendly chunks. For advanced RAG systems, chunking is not optional; it determines retrieval quality more than most people expect.

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader("docs/policy.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=120,
    separators=["\n\n", "\n", " ", ""],
)

chunks = splitter.split_documents(documents)
print(f"Loaded {len(documents)} pages, created {len(chunks)} chunks")

•Create embeddings and store the chunks in a vector index. FAISS is fine for local development and internal tools; if you need persistence across restarts, you can save and reload the index later.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(chunks, embeddings)

vectorstore.save_local("faiss_index")
print("Vector index saved to faiss_index/")

•Build a retriever that returns the most relevant chunks for a question. For more controlled retrieval, use MMR so you get diverse results instead of near-duplicates.

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.load_local(
    "faiss_index",
    embeddings,
    allow_dangerous_deserialization=True,
)

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 12},
)

query = "What is the policy for data retention?"
docs = retriever.invoke(query)
for i, doc in enumerate(docs, 1):
    print(f"\n--- Chunk {i} ---\n{doc.page_content[:400]}")

•Wire the retriever into a prompt-driven generation chain. The key pattern here is: retrieve context first, then force the model to answer only from that context and cite uncertainty when the answer is not present.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template(
    """You are a careful assistant.
Answer only using the context below.
If the context does not contain the answer, say you don't know.

Context:
{context}

Question:
{question}

Answer:"""
)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

question = "What is the policy for data retention?"
context = format_docs(retriever.invoke(question))

messages = prompt.format_messages(context=context, question=question)
response = llm.invoke(messages)
print(response.content)

•Package retrieval and generation into one reusable function. This is the version you actually ship: one call in, grounded answer out, with deterministic behavior and no hidden state.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template(
    """You are a policy assistant.
Use only the provided context.
If the answer is missing, say: I don't know based on the provided documents.

Context:
{context}

Question:
{question}

Answer in 3-5 sentences max."""
)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def rag_answer(question: str) -> str:
    docs = retriever.invoke(question)
    context = "\n\n".join(doc.page_content for doc in docs)
    messages = prompt.format_messages(context=context, question=question)
    return llm.invoke(messages).content

print(rag_answer("How long do we keep customer records?"))

Testing It

Run a few questions that should be answered directly by your source documents, then compare the output against the original text. If your answers are vague or wrong, fix chunk size first before touching prompts or models.

Also test an out-of-scope question like “What is our holiday policy?” if that information is not in the indexed files. The model should refuse cleanly with “I don't know” rather than inventing an answer.

For deeper validation, inspect which chunks were retrieved for each query and confirm they contain the evidence you expect. In RAG systems, retrieval quality is usually where failures start.

Next Steps

•Add metadata filtering so different document types can be queried separately
•Swap FAISS for a persistent vector database like pgvector or Pinecone
•Add reranking before generation to improve precision on ambiguous queries

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit