LangChain Tutorial (Python): chunking large documents for advanced developers
This tutorial shows you how to split large documents into chunks that are actually usable for retrieval, summarization, and RAG pipelines in LangChain. You need this when raw documents are too long for model context windows, or when naive splitting destroys structure and hurts answer quality.
What You'll Need
- •Python 3.10+
- •
langchain - •
langchain-text-splitters - •
tiktokenfor token-aware splitting - •A sample long text file or string
- •Optional:
langchain-openaiif you want to test downstream embeddings or chat models
Install the core packages:
pip install langchain langchain-text-splitters tiktoken
Step-by-Step
- •Start by loading a large document into memory. For production systems, this usually comes from PDFs, HTML, legal contracts, policy manuals, or internal knowledge bases.
from pathlib import Path
file_path = Path("document.txt")
text = file_path.read_text(encoding="utf-8")
print(f"Loaded {len(text)} characters")
print(text[:500])
- •Use a recursive splitter first. This is the default workhorse for large documents because it tries paragraph boundaries before falling back to smaller separators like sentences and words.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1200,
chunk_overlap=150,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
for i, chunk in enumerate(chunks[:3]):
print(f"\n--- Chunk {i+1} ({len(chunk)} chars) ---")
print(chunk[:400])
- •If your downstream model is token-limited, switch to token-aware splitting. Character counts are approximate; token counts are what matter when you are building retrieval pipelines or passing context into chat models.
from langchain_text_splitters import TokenTextSplitter
token_splitter = TokenTextSplitter(
chunk_size=300,
chunk_overlap=50,
)
token_chunks = token_splitter.split_text(text)
print(f"Created {len(token_chunks)} token-based chunks")
for i, chunk in enumerate(token_chunks[:2]):
print(f"\n--- Token Chunk {i+1} ---")
print(chunk[:400])
- •Preserve metadata when you need traceability. In real systems, you want each chunk tied back to the source document so you can cite it later in retrieval results or audit logs.
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
docs = [
Document(
page_content=text,
metadata={"source": "document.txt", "doc_type": "policy"}
)
]
splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=150)
chunked_docs = splitter.split_documents(docs)
print(f"Created {len(chunked_docs)} Document chunks")
print(chunked_docs[0].metadata)
print(chunked_docs[0].page_content[:300])
- •Tune chunk size based on the task. For semantic search, smaller chunks often retrieve better; for summarization and extraction, slightly larger chunks can preserve enough surrounding context to avoid fragmenting meaning.
from langchain_text_splitters import RecursiveCharacterTextSplitter
configs = [
{"chunk_size": 800, "chunk_overlap": 100},
{"chunk_size": 1200, "chunk_overlap": 150},
{"chunk_size": 2000, "chunk_overlap": 200},
]
for cfg in configs:
splitter = RecursiveCharacterTextSplitter(**cfg)
parts = splitter.split_text(text)
avg_len = sum(len(p) for p in parts) / len(parts)
print(cfg, "chunks:", len(parts), "avg chars:", round(avg_len))
Testing It
Verify that your chunks are not too small and not too large. If most chunks are under a few hundred characters, retrieval quality usually suffers because there is not enough context; if they are too large, you will waste tokens and reduce recall.
Also inspect overlap manually. The end of one chunk should usually flow into the beginning of the next without duplicating entire sections.
If you are building RAG, run a quick retrieval test against a few known questions and check whether the returned chunks contain the exact answer span. That tells you more than raw chunk counts ever will.
Next Steps
- •Add embeddings and a vector store so you can test retrieval quality against different chunking strategies.
- •Learn how to split by structure first: headings, sections, pages, then fall back to recursive splitting.
- •Build an evaluation script that compares recall across multiple
chunk_sizeandchunk_overlapsettings.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit