Haystack Tutorial (Python): handling long documents for intermediate developers

By Cyprian AaronsUpdated 2026-04-21

haystackhandling-long-documents-for-intermediate-developerspython

This tutorial shows you how to ingest, split, retrieve, and answer over long documents in Haystack using Python. You need this when your source material is too large for a single prompt, and you want a retrieval pipeline that stays accurate instead of stuffing the whole document into the LLM context.

What You'll Need

•Python 3.10+
•haystack-ai
•An OpenAI API key set as OPENAI_API_KEY
•A plain-text or PDF document to test with
•Basic familiarity with Haystack components like Document, DocumentStore, and Pipeline

Step-by-Step

•Start by installing Haystack and creating a clean environment. For long-document workflows, you want the latest Haystack 2.x APIs so the component names match what you see here.

pip install haystack-ai
export OPENAI_API_KEY="your-key-here"

•Load your long document into memory and split it into smaller chunks. The important part is chunk size: too small and you lose context, too large and retrieval gets noisy.

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter

long_text = """
Haystack is an open-source framework for building LLM applications...
""" * 50

document = Document(content=long_text, meta={"source": "internal_guide.txt"})

splitter = DocumentSplitter(split_by="word", split_length=200, split_overlap=40)
split_docs = splitter.run([document])["documents"]

print(f"Original docs: 1")
print(f"Split docs: {len(split_docs)}")
print(split_docs[0].content[:300])

•Index the chunks in an in-memory document store and embed them. This gives you semantic search over the chunks instead of brute-forcing the entire document into a prompt.

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

doc_embedder = SentenceTransformersDocumentEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)
doc_embedder.warm_up()

embedded_docs = doc_embedder.run(split_docs)["documents"]

writer = DocumentWriter(document_store=document_store)
writer.run(embedded_docs)

print(document_store.count_documents())

•Build a retriever plus generator pipeline that answers questions from the retrieved chunks. This is the core pattern for long documents: retrieve only the relevant sections, then let the LLM synthesize an answer.

from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators.openai import OpenAIGenerator
from haystack import Pipeline

retriever = InMemoryEmbeddingRetriever(document_store=document_store)

template = """
Answer the question using only the following documents.

Documents:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}

Question: {{ question }}
Answer:
"""

prompt_builder = PromptBuilder(template=template)
generator = OpenAIGenerator(model="gpt-4o-mini")

pipe = Pipeline()
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("generator", generator)

pipe.connect("retriever.documents", "prompt_builder.documents")
pipe.connect("prompt_builder.prompt", "generator.prompt")

•Run a query against the pipeline and inspect the result. Keep questions specific; long-document systems work best when retrieval has a narrow target.

question = "What does this guide say about splitting documents?"
query_embedding = doc_embedder.run([Document(content=question)])["documents"][0].embedding

result = pipe.run({
    "retriever": {"query_embedding": query_embedding},
    "prompt_builder": {"question": question},
})

print(result["generator"]["replies"][0])

•Tighten quality by tuning chunk size and overlap, then test with multiple questions. If answers are missing context, increase overlap; if retrieval feels broad, reduce chunk size.

test_questions = [
    "Why do we split long documents?",
    "What embedding model is used?",
    "How does retrieval help with long content?"
]

for q in test_questions:
    q_embedding = doc_embedder.run([Document(content=q)])["documents"][0].embedding
    out = pipe.run({
        "retriever": {"query_embedding": q_embedding},
        "prompt_builder": {"question": q},
    })
    print("\nQ:", q)
    print("A:", out["generator"]["replies"][0])

Testing It

Run three or four targeted questions that should each map to different parts of your source document. If the answers are vague or hallucinated, check whether your chunks are too large or whether your retriever is returning enough documents.

Also inspect retrieved content directly before generation when debugging. In production, I usually log top-k chunk IDs, scores, and source metadata so I can see whether bad answers come from retrieval or generation.

Next Steps

•Add a PDF or DOCX loader instead of hardcoding text.
•Swap InMemoryDocumentStore for PostgreSQL or Elasticsearch when you need persistence.
•Add reranking after retrieval if your long documents contain dense technical language.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit