LangChain Tutorial (Python): handling long documents for advanced developers
This tutorial shows you how to take long documents, split them into usable chunks, index them, and answer questions over them with LangChain in Python. You need this when your source material is too large for a single prompt, or when you want retrieval that stays accurate across contracts, policies, reports, or case files.
What You'll Need
- •Python 3.10+
- •An OpenAI API key set as
OPENAI_API_KEY - •These packages:
- •
langchain - •
langchain-openai - •
langchain-community - •
faiss-cpu
- •
- •A long text document to test with, saved locally as a
.txtfile - •Basic familiarity with LangChain chains and chat models
Step-by-Step
- •Start by loading your long document from disk. For production work, keep the raw source separate from any processed chunks so you can rebuild the index when chunking strategy changes.
from pathlib import Path
doc_path = Path("long_document.txt")
text = doc_path.read_text(encoding="utf-8")
print(f"Loaded {len(text)} characters")
print(text[:500])
- •Split the document into overlapping chunks. The overlap matters because important facts often cross chunk boundaries, especially in legal and insurance documents.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1200,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
print(chunks[0][:800])
- •Turn those chunks into retrievable vectors and store them in FAISS. This is the part that lets you search long documents without stuffing everything into the context window.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
docs = [
Document(page_content=chunk, metadata={"chunk_id": i})
for i, chunk in enumerate(chunks)
]
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
- •Build a retrieval chain that answers questions using only the most relevant chunks. For long-document workflows, this is the default pattern you should reach for before considering summarization pipelines.
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer using only the provided context. If the answer is not in the context, say you don't know."),
("human", "Question: {input}\n\nContext:\n{context}")
])
document_chain = create_stuff_documents_chain(llm, prompt)
qa_chain = create_retrieval_chain(retriever, document_chain)
- •Query the document and inspect which chunks were used. In real systems, logging retrieved chunk IDs is useful for debugging bad answers and proving traceability.
question = "What are the main obligations described in the document?"
result = qa_chain.invoke({"input": question})
print("Answer:")
print(result["answer"])
print("\nRetrieved chunk IDs:")
for doc in result["context"]:
print(doc.metadata["chunk_id"])
- •If your document is extremely large, add a preprocessing pass for summaries before retrieval. Use summaries for broad navigation and vector search for exact detail; that combination works better than either alone.
from langchain_core.prompts import PromptTemplate
from langchain.chains.llm import LLMChain
summary_prompt = PromptTemplate.from_template(
"Summarize this section in 5 bullet points:\n\n{section}"
)
summary_chain = LLMChain(llm=llm, prompt=summary_prompt)
section_summaries = []
for i in range(0, len(chunks), 5):
section = "\n\n".join(chunks[i:i+5])
summary = summary_chain.invoke({"section": section})["text"]
section_summaries.append(summary)
print(section_summaries[0])
Testing It
Run the script against a document where you already know a few facts, then ask targeted questions about those facts. Check that the answer matches the source text and that retrieved chunks include the relevant section rather than random neighbors.
If answers are vague or wrong, reduce chunk_size, increase chunk_overlap, or raise k from 4 to 6. If retrieval still misses key content, your separators may be too coarse for the structure of the document.
For production validation, test three cases: an answer clearly present in one chunk, an answer split across two chunks, and a question whose answer does not exist in the document. The last case should return “I don’t know,” not a hallucination.
Next Steps
- •Add metadata filters for document type, policy number, client ID, or effective date
- •Replace FAISS with a persistent vector store like pgvector or Pinecone for multi-session workloads
- •Learn map-reduce and refine summarization chains for documents that are too large even for retrieval-first workflows
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit