Haystack Tutorial (Python): handling long documents for beginners
This tutorial shows you how to load long documents into Haystack, split them into manageable chunks, index them, and retrieve the right passages with Python. You need this when a single PDF, policy manual, or contract is too large to stuff into one prompt and you want retrieval that stays accurate.
What You'll Need
- •Python 3.10+
- •Haystack 2.x
- •An OpenAI API key if you want to use the embedding model below
- •A working internet connection for model downloads and API calls
- •Basic familiarity with Haystack
Document,Pipeline, and retrievers - •Optional: a
.envfile for storingOPENAI_API_KEY
Install the packages:
pip install haystack-ai haystack-integrations openai
Step-by-Step
- •Start with a long document and split it into smaller chunks.
Long documents hurt retrieval when they are treated as one block, because embeddings become too broad and relevant passages get buried.
from haystack import Document
from haystack.components.preprocessors import DocumentSplitter
long_text = """
Haystack is a framework for building LLM applications.
It supports document retrieval, question answering, and pipelines.
For long documents, splitting is important because embeddings work better on focused chunks.
This is especially useful for policies, contracts, manuals, and reports.
""" * 20
document = Document(content=long_text)
splitter = DocumentSplitter(split_by="word", split_length=40, split_overlap=10)
chunks = splitter.run([document])["documents"]
print(f"Original length: {len(document.content.split())} words")
print(f"Chunk count: {len(chunks)}")
print(chunks[0].content[:200])
- •Inspect the chunk metadata so you know what Haystack produced.
In production, this is where you confirm chunk size and overlap before indexing anything.
for i, chunk in enumerate(chunks[:3], start=1):
print(f"Chunk {i}")
print("ID:", chunk.id)
print("Meta:", chunk.meta)
print("Text:", chunk.content[:120])
print("-" * 40)
- •Embed the chunks and store them in an in-memory document store.
This example uses OpenAI embeddings because they are easy to wire up for beginners, but the pattern is the same for other embedding backends.
import os
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "")
document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
embedded_result = embedder.run(chunks)
embedded_chunks = embedded_result["documents"]
document_store.write_documents(embedded_chunks)
print("Stored documents:", document_store.count_documents())
- •Build a retrieval pipeline that chunks long documents first, then searches the embedded index.
The key idea is that you never query the raw long document directly; you query the indexed chunks.
from haystack import Pipeline
from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
query_pipeline = Pipeline()
query_pipeline.add_component("query_embedder", OpenAITextEmbedder(model="text-embedding-3-small"))
query_pipeline.add_component(
"retriever",
InMemoryEmbeddingRetriever(document_store=document_store),
)
query_pipeline.connect("query_embedder.embedding", "retriever.query_embedding")
question = "Why do we split long documents?"
result = query_pipeline.run({"query_embedder": {"text": question}})
for doc in result["retriever"]["documents"][:3]:
print(doc.score)
print(doc.content[:200])
print("=" * 60)
- •Keep chunk size practical for your use case.
For policy docs and manuals, start with 200–400 words per chunk and overlap by 10–20 percent; too small means context loss, too large means noisy retrieval.
def build_chunks(text: str):
doc = Document(content=text)
splitter = DocumentSplitter(split_by="word", split_length=300, split_overlap=50)
return splitter.run([doc])["documents"]
sample_chunks = build_chunks(long_text)
print("Chunks:", len(sample_chunks))
print("First chunk word count:", len(sample_chunks[0].content.split()))
print("Second chunk starts with:", sample_chunks[1].content[:100])
Testing It
Run the script end to end and confirm that the number of stored documents matches the number of generated chunks. Then ask a question that should clearly map to one section of the text, like “Why do we split long documents?”, and check that the top retrieved chunk contains that answer.
If retrieval looks random, your chunks are probably too large or too small. If you see empty embeddings or API errors, verify OPENAI_API_KEY is set in your environment before running the embedder step.
A good sanity check is to change the query wording slightly and see whether the same relevant chunk still appears near the top. That tells you your chunking strategy is stable enough for real user questions.
Next Steps
- •Add metadata like source file name, page number, or section title before indexing.
- •Replace
InMemoryDocumentStorewith a persistent backend such as Elasticsearch or PostgreSQL. - •Add a generator component so retrieved chunks feed directly into an answer-producing pipeline.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit