LlamaIndex Tutorial (Python): chunking large documents for intermediate developers
This tutorial shows you how to split large documents into chunks with LlamaIndex in Python, then inspect and tune those chunks before feeding them into retrieval or RAG pipelines. You need this when your source files are too large for a single context window, or when naive splitting causes bad retrieval because the chunks are too big, too small, or cut across semantic boundaries.
What You'll Need
- •Python 3.10+
- •
llama-index - •
llama-index-readers-file - •
llama-index-embeddings-openai - •An OpenAI API key set as
OPENAI_API_KEY - •A large local text file, PDF, or Markdown document
- •Basic familiarity with
Document,Settings, andVectorStoreIndex
Step-by-Step
- •Install the packages and set up your environment.
For this tutorial, we’ll use a local text file so you can run it without depending on external loaders beyond the file reader package.
pip install llama-index llama-index-readers-file llama-index-embeddings-openai
export OPENAI_API_KEY="your-api-key"
- •Load a large document into LlamaIndex.
TheSimpleDirectoryReadergives you a clean way to ingest files from a folder, which is enough for most internal docs and policy files.
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_dir="./data",
required_exts=[".txt"]
).load_data()
print(f"Loaded {len(documents)} document(s)")
print(documents[0].text[:500])
- •Chunk the documents with a custom splitter.
This is the main step: useSentenceSplitterto control chunk size and overlap so retrieval has enough surrounding context without bloating every chunk.
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(
chunk_size=512,
chunk_overlap=64
)
nodes = splitter.get_nodes_from_documents(documents)
print(f"Created {len(nodes)} chunks")
for i, node in enumerate(nodes[:3]):
print(f"\nChunk {i+1}")
print(node.text[:400])
- •Inspect chunk boundaries before indexing.
Intermediate developers should always verify chunking output on real data. If you see headers isolated from body text or tables split awkwardly, adjustchunk_sizeandchunk_overlap.
for i, node in enumerate(nodes[:5]):
print("=" * 80)
print(f"Chunk {i+1}")
print(f"Characters: {len(node.text)}")
print(node.text)
- •Build an index from the chunks and test retrieval.
Once the chunks look right, build a vector index and query it. This confirms that your chunking strategy supports useful retrieval instead of just producing clean-looking text blocks.
from llama_index.core import Settings, VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine()
response = query_engine.query("What does the document say about retention policy?")
print(response)
- •Tune for different document types.
Policy docs, contracts, meeting notes, and technical manuals need different settings. Use smaller chunks for dense factual text and larger chunks for narrative or procedural content.
from llama_index.core.node_parser import SentenceSplitter
configs = [
{"name": "dense_policy", "chunk_size": 384, "chunk_overlap": 48},
{"name": "long_procedure", "chunk_size": 768, "chunk_overlap": 96},
]
for cfg in configs:
splitter = SentenceSplitter(
chunk_size=cfg["chunk_size"],
chunk_overlap=cfg["chunk_overlap"]
)
test_nodes = splitter.get_nodes_from_documents(documents)
print(cfg["name"], len(test_nodes), "chunks")
Testing It
Run the script against one real document that is long enough to matter, not a toy paragraph. Check that the number of chunks is reasonable and that adjacent chunks overlap enough to preserve context without repeating entire sections.
Then ask queries that should land in different parts of the document and compare results across chunk sizes. If retrieval keeps missing obvious answers, your chunks are probably too large, too small, or splitting important phrases across boundaries.
A good sanity check is to print the first few chunks and confirm headings stay attached to their content. For production work, keep a sample set of documents and queries so you can compare chunking settings before rolling them into an agent pipeline.
Next Steps
- •Learn how to use
SemanticSplitterNodeParserwhen sentence-based splitting is not enough. - •Add metadata like source file name, section heading, or page number before indexing.
- •Compare retrieval quality across
VectorStoreIndex, rerankers, and hybrid search for your own corpus
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit