Haystack Tutorial (Python): handling long documents for advanced developers

By Cyprian AaronsUpdated 2026-04-21
haystackhandling-long-documents-for-advanced-developerspython

This tutorial shows how to ingest, split, retrieve, and answer questions over long documents in Haystack using a production-friendly Python pipeline. You need this when your source files are too large for a single prompt, or when you want better retrieval quality than stuffing raw text into an LLM.

What You'll Need

  • Python 3.10+
  • haystack-ai
  • openai package if you want to use OpenAI-backed generators and embedders
  • An OPENAI_API_KEY environment variable
  • A local text file or PDF-like content you can convert to plain text
  • Basic familiarity with Haystack Document, Pipeline, and retrievers

Install the packages:

pip install haystack-ai openai

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

  1. Start by loading a long document as plain text and wrapping it in a Haystack Document. For real systems, this is usually the output of OCR, HTML extraction, or PDF parsing.
from haystack import Document

with open("policy_manual.txt", "r", encoding="utf-8") as f:
    text = f.read()

document = Document(content=text, meta={"source": "policy_manual.txt"})
print(len(document.content))
print(document.meta)
  1. Split the document into overlapping chunks before indexing. Long documents need chunking because embedding models and LLM context windows are limited, and overlap helps preserve cross-boundary meaning.
from haystack.components.preprocessors import DocumentSplitter

splitter = DocumentSplitter(
    split_by="word",
    split_length=250,
    split_overlap=50,
)

docs = splitter.run([document])["documents"]
print(f"Chunks: {len(docs)}")
print(docs[0].content[:300])
  1. Embed the chunks and write them into an in-memory document store. This gives you semantic retrieval over the long document instead of brute-force prompt stuffing.
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
writer = DocumentWriter(document_store=document_store)

embedded_docs = embedder.run(docs)["documents"]
writer.run(embedded_docs)

print(document_store.count_documents())
  1. Build a retrieval pipeline that turns a question into relevant chunks. For long documents, this is the core pattern: retrieve only the most relevant passages, then send those to the generator.
from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.core.pipeline import Pipeline

query_embedder = OpenAITextEmbedder(model="text-embedding-3-small")
retriever = InMemoryEmbeddingRetriever(document_store=document_store)

retrieval_pipeline = Pipeline()
retrieval_pipeline.add_component("query_embedder", query_embedder)
retrieval_pipeline.add_component("retriever", retriever)

retrieval_pipeline.connect("query_embedder.embedding", "retriever.query_embedding")

result = retrieval_pipeline.run({
    "query_embedder": {"text": "What does the policy say about claims escalation?"}
})

for doc in result["retriever"]["documents"][:3]:
    print(doc.content[:200])
    print("---")
  1. Add a generator so the model answers using only retrieved context. This keeps responses grounded in the source document and avoids forcing a long file into the prompt directly.
from haystack.components.builders import PromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator

template = """
Answer the question using only the provided documents.

Question: {{question}}

Documents:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}

Answer:
"""

prompt_builder = PromptBuilder(template=template)
generator = OpenAIChatGenerator(model="gpt-4o-mini")

qa_pipeline = Pipeline()
qa_pipeline.add_component("query_embedder", query_embedder)
qa_pipeline.add_component("retriever", retriever)
qa_pipeline.add_component("prompt_builder", prompt_builder)
qa_pipeline.add_component("llm", generator)

qa_pipeline.connect("query_embedder.embedding", "retriever.query_embedding")
qa_pipeline.connect("retriever.documents", "prompt_builder.documents")
qa_pipeline.connect("prompt_builder.prompt", "llm.messages")

response = qa_pipeline.run({
    "query_embedder": {"text": "What does the policy say about claims escalation?"},
    "prompt_builder": {"question": "What does the policy say about claims escalation?"}
})

print(response["llm"]["replies"][0].text)

Testing It

Run a few questions that require different sections of the document, not just one obvious paragraph. If chunking and retrieval are working, you should see different chunks surface for different queries.

Check that answers quote or closely track the source material instead of hallucinating details. If they drift, reduce chunk size slightly, increase overlap, or retrieve more documents before generation.

Also verify that irrelevant questions produce weak or empty answers rather than confident nonsense. That usually means your retrieval layer is doing its job.

Next Steps

  • Add metadata filters so you can query by section, page number, or document type.
  • Swap InMemoryDocumentStore for a persistent store like Elasticsearch or Qdrant.
  • Add reranking before generation if you need better precision on dense enterprise documents.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides