LlamaIndex Tutorial (Python): handling long documents for intermediate developers

By Cyprian AaronsUpdated 2026-04-21

llamaindexhandling-long-documents-for-intermediate-developerspython

This tutorial shows you how to ingest long documents with LlamaIndex in Python without blowing past model context limits. You’ll build a chunking and retrieval pipeline that can handle PDFs, policy docs, contracts, or research papers by splitting them into usable pieces and querying only the relevant parts.

What You'll Need

•Python 3.10+
•llama-index
•
An LLM API key, such as:
- •OPENAI_API_KEY
•
Optional but useful:
- •pypdf for PDF loading
- •python-dotenv for local env vars
•
A long document to test with:
- •PDF
- •.txt
- •.md

Install the core packages:

pip install llama-index llama-index-llms-openai llama-index-embeddings-openai pypdf python-dotenv

Step-by-Step

•Start by loading your API key and setting up the LLM and embedding model. For long documents, embeddings matter as much as generation because retrieval quality depends on them.

import os
from dotenv import load_dotenv

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

load_dotenv()

if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("Set OPENAI_API_KEY in your environment")

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

•Load the document from disk. If you’re working with PDFs, LlamaIndex will use the reader stack underneath; for plain text files, this is straightforward and reliable.

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./data/long_document.pdf"]
).load_data()

print(f"Loaded {len(documents)} document(s)")
print(documents[0].text[:500])

•Split the document into chunks that fit retrieval and model constraints. The default chunk sizes are often fine for general use, but for long legal or financial docs you should control them explicitly.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=128
)

nodes = splitter.get_nodes_from_documents(documents)

print(f"Created {len(nodes)} chunks")
print(nodes[0].text[:300])

•Build an index over those chunks. A vector index is the right default when you want semantic search over long documents instead of stuffing everything into a prompt.

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine(similarity_top_k=3)

response = query_engine.query(
    "What are the main obligations mentioned in the document?"
)

print(response)

•Add source visibility so you can inspect which chunks were used. This matters in production because users will ask where an answer came from, especially in regulated workflows.

response = query_engine.query(
    "Summarize the termination conditions."
)

print("Answer:")
print(response)

print("\nSources:")
for i, source in enumerate(response.source_nodes, start=1):
    print(f"\nSource {i}: score={source.score}")
    print(source.node.text[:400])

•If your document is very large, persist the index so you don’t rebuild embeddings on every run. This is the difference between a prototype and something you can actually operate.

from llama_index.core import StorageContext, load_index_from_storage

persist_dir = "./storage"
index.storage_context.persist(persist_dir=persist_dir)

storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
loaded_index = load_index_from_storage(storage_context)

loaded_query_engine = loaded_index.as_query_engine(similarity_top_k=3)
result = loaded_query_engine.query("What is this document about?")

print(result)

Testing It

Run the script against a real long document with at least a few thousand words. Ask questions that require different parts of the file, not just the first page, such as “What are the exceptions?” or “What deadlines are mentioned?”

If answers look vague, reduce chunk_size or increase similarity_top_k. If answers miss important context, try a larger overlap like chunk_overlap=200 so sentences crossing chunk boundaries stay intact.

Also check the source snippets printed from response.source_nodes. If they do not match the question well, your chunking strategy is probably too coarse or your embedding model choice is weak for your domain.

Next Steps

•Add metadata filters like document type, client name, or effective date before indexing.
•Try RecursiveRetriever or hybrid retrieval when one vector search is not enough.
•Replace simple chunking with domain-aware parsing for contracts, claims files, or medical records.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit