How to Integrate Haystack for investment banking with Elasticsearch for startups
Combining Haystack for investment banking with Elasticsearch gives you a practical retrieval layer for financial documents, market research, filings, and internal deal notes. In a startup AI agent system, this setup lets you answer analyst questions with grounded evidence instead of free-form guesses, while keeping search fast enough for interactive workflows.
Prerequisites
- •Python 3.10+
- •An Elasticsearch cluster running locally or in the cloud
- •API credentials for Elasticsearch
- •Haystack installed in your project
- •Access to the Haystack components you plan to use:
- •
Document - •
InMemoryDocumentStoreor an Elasticsearch-backed document store - •
Pipeline - •retriever and generator components
- •
- •A
.envfile or secret manager for credentials - •Sample investment banking documents:
- •pitch decks
- •earnings summaries
- •company profiles
- •deal notes
Integration Steps
- •Install the dependencies.
pip install haystack-ai elasticsearch python-dotenv
If you are using an Elasticsearch-backed store in Haystack, install the integration package your version requires. The exact package name depends on the Haystack release line, so pin versions in requirements.txt before moving to production.
- •Connect to Elasticsearch and create a document store.
import os
from dotenv import load_dotenv
from elasticsearch import Elasticsearch
from haystack import Document
load_dotenv()
es_client = Elasticsearch(
os.environ["ELASTICSEARCH_URL"],
basic_auth=(
os.environ["ELASTICSEARCH_USERNAME"],
os.environ["ELASTICSEARCH_PASSWORD"],
),
)
# Example: create a raw index if you want to inspect data directly in Elasticsearch.
index_name = "investment-banking-docs"
if not es_client.indices.exists(index=index_name):
es_client.indices.create(index=index_name)
print(es_client.info())
For Haystack, use the document store that matches your deployment. If your version includes an Elasticsearch-backed store, wire it to the same cluster so retrieval and indexing share one source of truth.
- •Index investment banking documents into Haystack and push them into Elasticsearch.
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
docs = [
Document(
content="Company A reported $120M revenue in FY2024 with 18% EBITDA margin.",
meta={"ticker": "CMPA", "type": "earnings_note", "source": "internal"},
),
Document(
content="Comparable companies show EV/Revenue multiples between 4.2x and 6.1x.",
meta={"ticker": "CMPA", "type": "valuation_note", "source": "research"},
),
]
# Replace this with your actual Elasticsearch-backed document store class.
# Example pattern used across Haystack integrations:
# document_store = ElasticsearchDocumentStore(...)
document_store = None
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)
writer.run(docs)
If you are indexing directly through Elasticsearch instead of a Haystack store, keep the same document schema: content plus metadata fields like ticker, sector, deal stage, and source.
- •Build a retrieval pipeline for analyst-style queries.
from haystack import Pipeline
from haystack.components.retrievers import InMemoryBM25Retriever
# If you're using an Elasticsearch-backed Haystack document store,
# swap InMemoryBM25Retriever for the retriever supported by that store.
retriever = InMemoryBM25Retriever(document_store=document_store)
pipeline = Pipeline()
pipeline.add_component("retriever", retriever)
query = "What is Company A's valuation range based on comparable companies?"
result = pipeline.run(
{
"retriever": {
"query": query,
"top_k": 3,
}
}
)
for doc in result["retriever"]["documents"]:
print(doc.content)
print(doc.meta)
For startup AI agents, this is the key pattern: retrieve first, then generate from retrieved evidence. That keeps outputs aligned with source documents instead of hallucinated finance jargon.
- •Add a generator step so the agent can answer using retrieved evidence.
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIChatGenerator
prompt_template = """
You are an investment banking analyst assistant.
Answer only from the provided documents.
Question: {{ question }}
Documents:
{% for doc in documents %}
- {{ doc.content }}
{% endfor %}
Answer:
"""
prompt_builder = PromptBuilder(template=prompt_template)
generator = OpenAIChatGenerator(model="gpt-4o-mini")
rag_pipeline = Pipeline()
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("generator", generator)
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "generator.messages")
response = rag_pipeline.run(
{
"retriever": {"query": query, "top_k": 3},
"prompt_builder": {"question": query},
}
)
print(response["generator"]["replies"][0].content)
This gives you a clean RAG loop for banking workflows: search Elasticsearch for relevant docs, pass them through Haystack, then generate a controlled answer.
Testing the Integration
Use one known query and one known document set. The test should confirm both retrieval and answer generation are grounded in indexed content.
test_query = "What EBITDA margin did Company A report?"
test_result = rag_pipeline.run(
{
"retriever": {"query": test_query, "top_k": 2},
"prompt_builder": {"question": test_query},
}
)
answer = test_result["generator"]["replies"][0].content
print(answer)
Expected output:
Company A reported an EBITDA margin of 18% in FY2024.
If retrieval is working but the answer is vague, check these first:
- •Your documents contain the exact financial facts you expect to retrieve
- •Metadata filters are not excluding relevant records
- •The retriever is connected to the correct document store or index
- •Your prompt instructs the model to answer only from retrieved context
Real-World Use Cases
- •Deal team Q&A bot that answers questions from CIMs, diligence notes, and market comps stored in Elasticsearch.
- •Earnings call assistant that pulls transcript snippets and summarizes guidance changes for bankers.
- •Internal research copilot that searches sector reports, valuation notes, and client memos with low-latency retrieval.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit