Haystack Tutorial (Python): handling long documents for advanced developers
This tutorial shows how to ingest, split, retrieve, and answer questions over long documents in Haystack using a production-friendly Python pipeline. You need this when your source files are too large for a single prompt, or when you want better retrieval quality than stuffing raw text into an LLM.
What You'll Need
- •Python 3.10+
- •
haystack-ai - •
openaipackage if you want to use OpenAI-backed generators and embedders - •An
OPENAI_API_KEYenvironment variable - •A local text file or PDF-like content you can convert to plain text
- •Basic familiarity with Haystack
Document,Pipeline, and retrievers
Install the packages:
pip install haystack-ai openai
Set your API key:
export OPENAI_API_KEY="your-key-here"
Step-by-Step
- •Start by loading a long document as plain text and wrapping it in a Haystack
Document. For real systems, this is usually the output of OCR, HTML extraction, or PDF parsing.
from haystack import Document
with open("policy_manual.txt", "r", encoding="utf-8") as f:
text = f.read()
document = Document(content=text, meta={"source": "policy_manual.txt"})
print(len(document.content))
print(document.meta)
- •Split the document into overlapping chunks before indexing. Long documents need chunking because embedding models and LLM context windows are limited, and overlap helps preserve cross-boundary meaning.
from haystack.components.preprocessors import DocumentSplitter
splitter = DocumentSplitter(
split_by="word",
split_length=250,
split_overlap=50,
)
docs = splitter.run([document])["documents"]
print(f"Chunks: {len(docs)}")
print(docs[0].content[:300])
- •Embed the chunks and write them into an in-memory document store. This gives you semantic retrieval over the long document instead of brute-force prompt stuffing.
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.components.writers import DocumentWriter
document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
writer = DocumentWriter(document_store=document_store)
embedded_docs = embedder.run(docs)["documents"]
writer.run(embedded_docs)
print(document_store.count_documents())
- •Build a retrieval pipeline that turns a question into relevant chunks. For long documents, this is the core pattern: retrieve only the most relevant passages, then send those to the generator.
from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.core.pipeline import Pipeline
query_embedder = OpenAITextEmbedder(model="text-embedding-3-small")
retriever = InMemoryEmbeddingRetriever(document_store=document_store)
retrieval_pipeline = Pipeline()
retrieval_pipeline.add_component("query_embedder", query_embedder)
retrieval_pipeline.add_component("retriever", retriever)
retrieval_pipeline.connect("query_embedder.embedding", "retriever.query_embedding")
result = retrieval_pipeline.run({
"query_embedder": {"text": "What does the policy say about claims escalation?"}
})
for doc in result["retriever"]["documents"][:3]:
print(doc.content[:200])
print("---")
- •Add a generator so the model answers using only retrieved context. This keeps responses grounded in the source document and avoids forcing a long file into the prompt directly.
from haystack.components.builders import PromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
template = """
Answer the question using only the provided documents.
Question: {{question}}
Documents:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
Answer:
"""
prompt_builder = PromptBuilder(template=template)
generator = OpenAIChatGenerator(model="gpt-4o-mini")
qa_pipeline = Pipeline()
qa_pipeline.add_component("query_embedder", query_embedder)
qa_pipeline.add_component("retriever", retriever)
qa_pipeline.add_component("prompt_builder", prompt_builder)
qa_pipeline.add_component("llm", generator)
qa_pipeline.connect("query_embedder.embedding", "retriever.query_embedding")
qa_pipeline.connect("retriever.documents", "prompt_builder.documents")
qa_pipeline.connect("prompt_builder.prompt", "llm.messages")
response = qa_pipeline.run({
"query_embedder": {"text": "What does the policy say about claims escalation?"},
"prompt_builder": {"question": "What does the policy say about claims escalation?"}
})
print(response["llm"]["replies"][0].text)
Testing It
Run a few questions that require different sections of the document, not just one obvious paragraph. If chunking and retrieval are working, you should see different chunks surface for different queries.
Check that answers quote or closely track the source material instead of hallucinating details. If they drift, reduce chunk size slightly, increase overlap, or retrieve more documents before generation.
Also verify that irrelevant questions produce weak or empty answers rather than confident nonsense. That usually means your retrieval layer is doing its job.
Next Steps
- •Add metadata filters so you can query by section, page number, or document type.
- •Swap
InMemoryDocumentStorefor a persistent store like Elasticsearch or Qdrant. - •Add reranking before generation if you need better precision on dense enterprise documents.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit