Haystack Tutorial (Python): handling long documents for beginners

By Cyprian AaronsUpdated 2026-04-21

haystackhandling-long-documents-for-beginnerspython

This tutorial shows you how to load long documents into Haystack, split them into manageable chunks, index them, and retrieve the right passages with Python. You need this when a single PDF, policy manual, or contract is too large to stuff into one prompt and you want retrieval that stays accurate.

What You'll Need

•Python 3.10+
•Haystack 2.x
•An OpenAI API key if you want to use the embedding model below
•A working internet connection for model downloads and API calls
•Basic familiarity with Haystack Document, Pipeline, and retrievers
•Optional: a .env file for storing OPENAI_API_KEY

Install the packages:

pip install haystack-ai haystack-integrations openai

Step-by-Step

•Start with a long document and split it into smaller chunks.
Long documents hurt retrieval when they are treated as one block, because embeddings become too broad and relevant passages get buried.

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter

long_text = """
Haystack is a framework for building LLM applications.
It supports document retrieval, question answering, and pipelines.
For long documents, splitting is important because embeddings work better on focused chunks.
This is especially useful for policies, contracts, manuals, and reports.
""" * 20

document = Document(content=long_text)

splitter = DocumentSplitter(split_by="word", split_length=40, split_overlap=10)
chunks = splitter.run([document])["documents"]

print(f"Original length: {len(document.content.split())} words")
print(f"Chunk count: {len(chunks)}")
print(chunks[0].content[:200])

•Inspect the chunk metadata so you know what Haystack produced.
In production, this is where you confirm chunk size and overlap before indexing anything.

for i, chunk in enumerate(chunks[:3], start=1):
    print(f"Chunk {i}")
    print("ID:", chunk.id)
    print("Meta:", chunk.meta)
    print("Text:", chunk.content[:120])
    print("-" * 40)

•Embed the chunks and store them in an in-memory document store.
This example uses OpenAI embeddings because they are easy to wire up for beginners, but the pattern is the same for other embedding backends.

import os
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "")

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")

embedded_result = embedder.run(chunks)
embedded_chunks = embedded_result["documents"]

document_store.write_documents(embedded_chunks)

print("Stored documents:", document_store.count_documents())

•Build a retrieval pipeline that chunks long documents first, then searches the embedded index.
The key idea is that you never query the raw long document directly; you query the indexed chunks.

from haystack import Pipeline
from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

query_pipeline = Pipeline()
query_pipeline.add_component("query_embedder", OpenAITextEmbedder(model="text-embedding-3-small"))
query_pipeline.add_component(
    "retriever",
    InMemoryEmbeddingRetriever(document_store=document_store),
)

query_pipeline.connect("query_embedder.embedding", "retriever.query_embedding")

question = "Why do we split long documents?"
result = query_pipeline.run({"query_embedder": {"text": question}})

for doc in result["retriever"]["documents"][:3]:
    print(doc.score)
    print(doc.content[:200])
    print("=" * 60)

•Keep chunk size practical for your use case.
For policy docs and manuals, start with 200–400 words per chunk and overlap by 10–20 percent; too small means context loss, too large means noisy retrieval.

def build_chunks(text: str):
    doc = Document(content=text)
    splitter = DocumentSplitter(split_by="word", split_length=300, split_overlap=50)
    return splitter.run([doc])["documents"]

sample_chunks = build_chunks(long_text)

print("Chunks:", len(sample_chunks))
print("First chunk word count:", len(sample_chunks[0].content.split()))
print("Second chunk starts with:", sample_chunks[1].content[:100])

Testing It

Run the script end to end and confirm that the number of stored documents matches the number of generated chunks. Then ask a question that should clearly map to one section of the text, like “Why do we split long documents?”, and check that the top retrieved chunk contains that answer.

If retrieval looks random, your chunks are probably too large or too small. If you see empty embeddings or API errors, verify OPENAI_API_KEY is set in your environment before running the embedder step.

A good sanity check is to change the query wording slightly and see whether the same relevant chunk still appears near the top. That tells you your chunking strategy is stable enough for real user questions.

Next Steps

•Add metadata like source file name, page number, or section title before indexing.
•Replace InMemoryDocumentStore with a persistent backend such as Elasticsearch or PostgreSQL.
•Add a generator component so retrieved chunks feed directly into an answer-producing pipeline.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit