LangChain Tutorial (Python): handling long documents for advanced developers

By Cyprian AaronsUpdated 2026-04-21

langchainhandling-long-documents-for-advanced-developerspython

This tutorial shows you how to take long documents, split them into usable chunks, index them, and answer questions over them with LangChain in Python. You need this when your source material is too large for a single prompt, or when you want retrieval that stays accurate across contracts, policies, reports, or case files.

What You'll Need

•Python 3.10+
•An OpenAI API key set as OPENAI_API_KEY
•
These packages:
- •langchain
- •langchain-openai
- •langchain-community
- •faiss-cpu
•A long text document to test with, saved locally as a .txt file
•Basic familiarity with LangChain chains and chat models

Step-by-Step

•Start by loading your long document from disk. For production work, keep the raw source separate from any processed chunks so you can rebuild the index when chunking strategy changes.

from pathlib import Path

doc_path = Path("long_document.txt")
text = doc_path.read_text(encoding="utf-8")

print(f"Loaded {len(text)} characters")
print(text[:500])

•Split the document into overlapping chunks. The overlap matters because important facts often cross chunk boundaries, especially in legal and insurance documents.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
print(chunks[0][:800])

•Turn those chunks into retrievable vectors and store them in FAISS. This is the part that lets you search long documents without stuffing everything into the context window.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

docs = [
    Document(page_content=chunk, metadata={"chunk_id": i})
    for i, chunk in enumerate(chunks)
]

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(docs, embeddings)

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

•Build a retrieval chain that answers questions using only the most relevant chunks. For long-document workflows, this is the default pattern you should reach for before considering summarization pipelines.

from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using only the provided context. If the answer is not in the context, say you don't know."),
    ("human", "Question: {input}\n\nContext:\n{context}")
])

document_chain = create_stuff_documents_chain(llm, prompt)
qa_chain = create_retrieval_chain(retriever, document_chain)

•Query the document and inspect which chunks were used. In real systems, logging retrieved chunk IDs is useful for debugging bad answers and proving traceability.

question = "What are the main obligations described in the document?"
result = qa_chain.invoke({"input": question})

print("Answer:")
print(result["answer"])

print("\nRetrieved chunk IDs:")
for doc in result["context"]:
    print(doc.metadata["chunk_id"])

•If your document is extremely large, add a preprocessing pass for summaries before retrieval. Use summaries for broad navigation and vector search for exact detail; that combination works better than either alone.

from langchain_core.prompts import PromptTemplate
from langchain.chains.llm import LLMChain

summary_prompt = PromptTemplate.from_template(
    "Summarize this section in 5 bullet points:\n\n{section}"
)
summary_chain = LLMChain(llm=llm, prompt=summary_prompt)

section_summaries = []
for i in range(0, len(chunks), 5):
    section = "\n\n".join(chunks[i:i+5])
    summary = summary_chain.invoke({"section": section})["text"]
    section_summaries.append(summary)

print(section_summaries[0])

Testing It

Run the script against a document where you already know a few facts, then ask targeted questions about those facts. Check that the answer matches the source text and that retrieved chunks include the relevant section rather than random neighbors.

If answers are vague or wrong, reduce chunk_size, increase chunk_overlap, or raise k from 4 to 6. If retrieval still misses key content, your separators may be too coarse for the structure of the document.

For production validation, test three cases: an answer clearly present in one chunk, an answer split across two chunks, and a question whose answer does not exist in the document. The last case should return “I don’t know,” not a hallucination.

Next Steps

•Add metadata filters for document type, policy number, client ID, or effective date
•Replace FAISS with a persistent vector store like pgvector or Pinecone for multi-session workloads
•Learn map-reduce and refine summarization chains for documents that are too large even for retrieval-first workflows

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit