LangChain Tutorial (Python): chunking large documents for beginners

By Cyprian AaronsUpdated 2026-04-21

langchainchunking-large-documents-for-beginnerspython

This tutorial shows you how to split large documents into smaller chunks with LangChain in Python, then prepare those chunks for downstream tasks like retrieval, summarization, and Q&A. You need this because LLMs have context limits, and feeding them entire PDFs or long text files usually gives worse results than working with well-sized chunks.

What You'll Need

•Python 3.10+
•langchain
•langchain-text-splitters
•langchain-community
•tiktoken if you want token-aware splitting
•A sample document in .txt, .md, or .pdf format
•Optional: an OpenAI API key if you later want to embed or query the chunks

Install the packages:

pip install langchain langchain-text-splitters langchain-community tiktoken pypdf

Step-by-Step

•Start with a plain text document loader. For beginners, a .txt file is the easiest place to start because it removes PDF parsing noise and lets you focus on chunking behavior.

from langchain_community.document_loaders import TextLoader

loader = TextLoader("example.txt", encoding="utf-8")
documents = loader.load()

print(f"Loaded {len(documents)} document(s)")
print(documents[0].page_content[:500])

•Inspect the raw content before splitting it. This matters because chunking strategy depends on structure: headings, paragraphs, and repeated boilerplate all affect how you should split.

text = documents[0].page_content

print("Characters:", len(text))
print("First 20 lines:")
for line in text.splitlines()[:20]:
    print(line)

•Split the document into overlapping chunks using RecursiveCharacterTextSplitter. This is the default workhorse for many LangChain pipelines because it tries separators in order and preserves some context between chunks.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""],
)

chunks = splitter.split_documents(documents)

print(f"Created {len(chunks)} chunks")
for i, chunk in enumerate(chunks[:3]):
    print(f"\n--- Chunk {i+1} ---")
    print(chunk.page_content[:400])

•Keep metadata attached to each chunk. This is what makes chunks usable later when you want to trace answers back to source documents, pages, or filenames.

for i, chunk in enumerate(chunks):
    chunk.metadata["chunk_id"] = i
    chunk.metadata["source"] = "example.txt"

print(chunks[0].metadata)

•If your downstream model uses token limits strictly, switch to token-aware splitting. Character counts are fine for a first pass, but token-based chunking gives more predictable behavior for embeddings and chat models.

from langchain_text_splitters import TokenTextSplitter

token_splitter = TokenTextSplitter(
    chunk_size=250,
    chunk_overlap=50,
)

token_chunks = token_splitter.split_documents(documents)

print(f"Token-based chunks: {len(token_chunks)}")
print(token_chunks[0].page_content[:400])

Testing It

Run the script against a real document and check that the output chunks are readable on their own. A good chunk should keep paragraphs intact where possible and avoid cutting sentences in awkward places too often.

Also compare the number of chunks produced by character-based splitting versus token-based splitting. If your chunks are too large, reduce chunk_size; if they are too fragmented, increase it or reduce overlap.

A practical test is to take one chunk and ask an LLM a question about only that chunk. If the answer is grounded and concise, your chunking setup is probably good enough for a first production pass.

Next Steps

•Add embeddings and store the chunks in a vector database like FAISS or Chroma.
•Learn how to load PDFs with page metadata using PyPDFLoader.
•Tune chunk sizes for specific tasks like legal QA, support ticket search, or document summarization.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit