LangChain Tutorial (Python): handling long documents for beginners

By Cyprian AaronsUpdated 2026-04-21

langchainhandling-long-documents-for-beginnerspython

This tutorial shows you how to take a long document, split it into chunks, index it with LangChain, and ask questions against it in Python. You need this when the source material is too large to fit in a model context window, which is the normal case for PDFs, policy docs, contracts, and internal knowledge bases.

What You'll Need

•Python 3.10+
•A working OpenAI API key
•
These packages:
- •langchain
- •langchain-openai
- •langchain-community
- •langchain-text-splitters
- •faiss-cpu
•A long text file to test with, such as document.txt

Install everything with:

pip install langchain langchain-openai langchain-community langchain-text-splitters faiss-cpu

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

•Start by loading a long document from disk. For beginners, plain text is easier than PDFs because you can focus on the LangChain flow instead of file parsing.

from pathlib import Path

file_path = Path("document.txt")
text = file_path.read_text(encoding="utf-8")

print(f"Loaded {len(text)} characters")
print(text[:500])

•Split the document into overlapping chunks. This is the core pattern for long-document handling because it keeps related context together while avoiding oversized inputs.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
)

chunks = splitter.split_text(text)

print(f"Created {len(chunks)} chunks")
print("First chunk preview:")
print(chunks[0][:400])

•Turn those chunks into embeddings and store them in a vector database. FAISS is a good beginner-friendly local option because it runs in-process and does not require extra infrastructure.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

documents = [Document(page_content=chunk) for chunk in chunks]

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(documents, embeddings)

print("Vector store ready")

•Create a retriever and wire it into a question-answering chain. The retriever pulls only the most relevant chunks, which keeps the prompt small enough for the model to handle reliably.

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
)

question = "What are the main topics covered in this document?"
answer = qa_chain.invoke({"query": question})

print(answer["result"])

•Ask targeted questions against the same document. This is where the setup becomes useful: you can query policy details, clause wording, or process steps without manually reading the whole file.

questions = [
    "Summarize the key risks mentioned.",
    "What actions are required by the reader?",
    "Are there any deadlines or time-sensitive items?",
]

for q in questions:
    result = qa_chain.invoke({"query": q})
    print("\nQUESTION:", q)
    print("ANSWER:", result["result"])

Testing It

Run the script against a document that is clearly longer than a single model prompt, ideally several pages of dense text. If everything is wired correctly, you should see chunk counts printed first, then answers that reference content from different parts of the file.

If the answers look vague or irrelevant, reduce chunk_size or increase k in the retriever so more context is available. If you get authentication errors, check that OPENAI_API_KEY is set in the same shell session where you run Python.

A good test question is one that requires information from more than one section of the document. That forces retrieval to do real work instead of just echoing nearby text.

Next Steps

•Learn how to load PDFs with PyPDFLoader instead of plain text files.
•Try different chain types like map_reduce when documents are very large.
•Add metadata to chunks so you can trace answers back to page numbers or sections.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit