LlamaIndex Tutorial (Python): chunking large documents for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
llamaindexchunking-large-documents-for-intermediate-developerspython

This tutorial shows you how to split large documents into chunks with LlamaIndex in Python, then inspect and tune those chunks before feeding them into retrieval or RAG pipelines. You need this when your source files are too large for a single context window, or when naive splitting causes bad retrieval because the chunks are too big, too small, or cut across semantic boundaries.

What You'll Need

  • Python 3.10+
  • llama-index
  • llama-index-readers-file
  • llama-index-embeddings-openai
  • An OpenAI API key set as OPENAI_API_KEY
  • A large local text file, PDF, or Markdown document
  • Basic familiarity with Document, Settings, and VectorStoreIndex

Step-by-Step

  1. Install the packages and set up your environment.
    For this tutorial, we’ll use a local text file so you can run it without depending on external loaders beyond the file reader package.
pip install llama-index llama-index-readers-file llama-index-embeddings-openai
export OPENAI_API_KEY="your-api-key"
  1. Load a large document into LlamaIndex.
    The SimpleDirectoryReader gives you a clean way to ingest files from a folder, which is enough for most internal docs and policy files.
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_dir="./data",
    required_exts=[".txt"]
).load_data()

print(f"Loaded {len(documents)} document(s)")
print(documents[0].text[:500])
  1. Chunk the documents with a custom splitter.
    This is the main step: use SentenceSplitter to control chunk size and overlap so retrieval has enough surrounding context without bloating every chunk.
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=64
)

nodes = splitter.get_nodes_from_documents(documents)

print(f"Created {len(nodes)} chunks")
for i, node in enumerate(nodes[:3]):
    print(f"\nChunk {i+1}")
    print(node.text[:400])
  1. Inspect chunk boundaries before indexing.
    Intermediate developers should always verify chunking output on real data. If you see headers isolated from body text or tables split awkwardly, adjust chunk_size and chunk_overlap.
for i, node in enumerate(nodes[:5]):
    print("=" * 80)
    print(f"Chunk {i+1}")
    print(f"Characters: {len(node.text)}")
    print(node.text)
  1. Build an index from the chunks and test retrieval.
    Once the chunks look right, build a vector index and query it. This confirms that your chunking strategy supports useful retrieval instead of just producing clean-looking text blocks.
from llama_index.core import Settings, VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine()

response = query_engine.query("What does the document say about retention policy?")
print(response)
  1. Tune for different document types.
    Policy docs, contracts, meeting notes, and technical manuals need different settings. Use smaller chunks for dense factual text and larger chunks for narrative or procedural content.
from llama_index.core.node_parser import SentenceSplitter

configs = [
    {"name": "dense_policy", "chunk_size": 384, "chunk_overlap": 48},
    {"name": "long_procedure", "chunk_size": 768, "chunk_overlap": 96},
]

for cfg in configs:
    splitter = SentenceSplitter(
        chunk_size=cfg["chunk_size"],
        chunk_overlap=cfg["chunk_overlap"]
    )
    test_nodes = splitter.get_nodes_from_documents(documents)
    print(cfg["name"], len(test_nodes), "chunks")

Testing It

Run the script against one real document that is long enough to matter, not a toy paragraph. Check that the number of chunks is reasonable and that adjacent chunks overlap enough to preserve context without repeating entire sections.

Then ask queries that should land in different parts of the document and compare results across chunk sizes. If retrieval keeps missing obvious answers, your chunks are probably too large, too small, or splitting important phrases across boundaries.

A good sanity check is to print the first few chunks and confirm headings stay attached to their content. For production work, keep a sample set of documents and queries so you can compare chunking settings before rolling them into an agent pipeline.

Next Steps

  • Learn how to use SemanticSplitterNodeParser when sentence-based splitting is not enough.
  • Add metadata like source file name, section heading, or page number before indexing.
  • Compare retrieval quality across VectorStoreIndex, rerankers, and hybrid search for your own corpus

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides