LangChain Tutorial (Python): chunking large documents for intermediate developers
This tutorial shows you how to split large documents into clean, usable chunks with LangChain in Python. You need this when your source text is too long for a single model prompt, or when you want better retrieval quality for RAG, search, or summarization pipelines.
What You'll Need
- •Python 3.10+
- •
langchain - •
langchain-text-splitters - •
openaiif you want to embed or summarize the chunks later - •An OpenAI API key set as
OPENAI_API_KEYif you plan to connect chunking to downstream LLM steps - •A large text file, PDF text export, or any long string you want to process
Install the packages:
pip install langchain langchain-text-splitters openai
Step-by-Step
- •Start by loading a large document into memory. For this tutorial, we’ll use a plain text file because it keeps the example focused on chunking rather than file parsing.
from pathlib import Path
file_path = Path("document.txt")
text = file_path.read_text(encoding="utf-8")
print(f"Loaded {len(text)} characters")
print(text[:500])
- •Use
RecursiveCharacterTextSplitterfor general-purpose chunking. It tries to preserve structure by splitting on paragraphs, then sentences, then words if needed.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=150,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
print("First chunk preview:")
print(chunks[0][:500])
- •Inspect chunk sizes before you wire them into embeddings or retrieval. If your chunks are too large, you’ll waste context window space; if they’re too small, you lose meaning and retrieval quality.
sizes = [len(chunk) for chunk in chunks]
print(f"Min chunk size: {min(sizes)}")
print(f"Max chunk size: {max(sizes)}")
print(f"Average chunk size: {sum(sizes) / len(sizes):.1f}")
for i, chunk in enumerate(chunks[:3], start=1):
print(f"\n--- Chunk {i} ({len(chunk)} chars) ---")
print(chunk[:300])
- •If you need metadata for a downstream vector store, split into
Documentobjects instead of raw strings. This is the pattern you want for production pipelines because it preserves source information.
from langchain_core.documents import Document
docs = [Document(page_content=chunk, metadata={"source": "document.txt", "chunk_id": i})
for i, chunk in enumerate(chunks)]
print(docs[0])
print(docs[0].metadata)
- •Tune the splitter based on your use case. Legal and insurance docs usually benefit from larger chunks with more overlap, while FAQ-style content can use smaller chunks with less overlap.
configs = [
{"name": "balanced", "chunk_size": 1000, "chunk_overlap": 150},
{"name": "retrieval_heavy", "chunk_size": 700, "chunk_overlap": 200},
{"name": "summary_focused", "chunk_size": 1500, "chunk_overlap": 100},
]
for cfg in configs:
splitter = RecursiveCharacterTextSplitter(
chunk_size=cfg["chunk_size"],
chunk_overlap=cfg["chunk_overlap"],
)
test_chunks = splitter.split_text(text)
print(cfg["name"], len(test_chunks), "chunks")
Testing It
Verify that every chunk is readable on its own and that no important section got cut off mid-thought. Check the first and last few chunks manually, then compare the total number of characters across chunks against the original text to confirm the splitter behaved as expected.
If you’re using these chunks for embeddings later, run a quick retrieval test with a known query and make sure the top results contain the right source section. In practice, good chunking should improve recall without making answers noisy.
Next Steps
- •Add a vector store like FAISS or Chroma and index the
Documentchunks - •Learn how to parse PDFs and DOCX files before splitting them
- •Compare
RecursiveCharacterTextSplitterwith token-based splitters for model-specific limits
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit