LlamaIndex Tutorial (Python): chunking large documents for beginners

By Cyprian AaronsUpdated 2026-04-21

llamaindexchunking-large-documents-for-beginnerspython

This tutorial shows you how to split large documents into smaller, usable chunks with LlamaIndex in Python. You need this when your source files are too big for a single retrieval pass, or when you want better search quality by controlling chunk size and overlap.

What You'll Need

•Python 3.10+
•llama-index
•A text file, PDF, or markdown document to chunk
•Optional: an OpenAI API key if you want to build retrieval on top later
•Basic familiarity with Python lists, loops, and file I/O

Install the package first:

pip install llama-index

Step-by-Step

•Start by loading a large document into LlamaIndex as a Document. For beginners, a plain .txt file is the easiest place to start because it avoids extra parsing dependencies.

from pathlib import Path
from llama_index.core import Document

file_path = Path("large_document.txt")
text = file_path.read_text(encoding="utf-8")

document = Document(text=text, metadata={"source": str(file_path)})
print(f"Loaded {len(document.text)} characters from {document.metadata['source']}")

•Create a splitter that controls chunk size and overlap. Smaller chunks improve retrieval precision, while overlap helps preserve context across boundaries.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=64,
)

chunks = splitter.split_text(document.text)
print(f"Created {len(chunks)} chunks")
print("\nFirst chunk preview:")
print(chunks[0][:500])

•If you want LlamaIndex-native objects instead of raw strings, convert the text into nodes. This is the format you will use later for indexing, retrieval, and citation-aware workflows.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
nodes = splitter.get_nodes_from_documents([document])

print(f"Created {len(nodes)} nodes")
first_node = nodes[0]
print("Node ID:", first_node.node_id)
print("Text preview:", first_node.text[:300])
print("Metadata:", first_node.metadata)

•Save the chunks so you can inspect them or feed them into another pipeline. In production, this is useful for debugging bad splits before they reach your vector index.

from pathlib import Path
from llama_index.core.node_parser import SentenceSplitter

output_dir = Path("chunks")
output_dir.mkdir(exist_ok=True)

splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_text(document.text)

for i, chunk in enumerate(chunks):
    (output_dir / f"chunk_{i:03d}.txt").write_text(chunk, encoding="utf-8")

print(f"Saved {len(chunks)} chunks to {output_dir.resolve()}")

•Tune the chunk settings based on document type. Legal contracts usually need larger chunks and more overlap; support tickets or FAQs usually work better with smaller chunks.

from llama_index.core.node_parser import SentenceSplitter

configs = [
    {"name": "small", "chunk_size": 256, "chunk_overlap": 32},
    {"name": "medium", "chunk_size": 512, "chunk_overlap": 64},
    {"name": "large", "chunk_size": 1024, "chunk_overlap": 128},
]

for cfg in configs:
    splitter = SentenceSplitter(
        chunk_size=cfg["chunk_size"],
        chunk_overlap=cfg["chunk_overlap"],
    )
    chunks = splitter.split_text(document.text)
    print(cfg["name"], "=>", len(chunks), "chunks")

Testing It

Run the script against a real document and check that the number of chunks is reasonable for the file size. Open a few saved chunk files and confirm that sentences are not being cut off mid-thought too often.

If your chunks are too small, increase chunk_size. If answers later lose context across boundaries, increase chunk_overlap.

A good quick test is to search for a specific paragraph in the original document and verify it appears intact in one or two adjacent chunks.

Next Steps

•Connect these nodes to a vector index like VectorStoreIndex
•Learn how to use SimpleDirectoryReader for PDFs and folders of documents
•Compare SentenceSplitter with token-based splitters for long-form legal or policy text

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit