How to Integrate Haystack for investment banking with Elasticsearch for RAG
Haystack for investment banking gives you the retrieval and orchestration layer for agentic workflows. Elasticsearch gives you durable, low-latency search over filings, transcripts, research notes, and internal deal docs. Put them together and you get a RAG system that can answer banker-grade questions with traceable context instead of guessing.
Prerequisites
- •Python 3.10+
- •An Elasticsearch cluster running locally or in Elastic Cloud
- •API credentials for Elasticsearch
- •Haystack installed with the Elasticsearch integration package
- •Access to your investment banking document corpus:
- •SEC filings
- •earnings call transcripts
- •pitch books
- •internal research notes
- •Basic familiarity with embeddings and RAG pipelines
Install the packages:
pip install haystack-ai elasticsearch-haystack elasticsearch sentence-transformers
Integration Steps
- •
Connect to Elasticsearch and create a document store
Start by initializing the Elasticsearch client and wrapping it with Haystack’s
ElasticsearchDocumentStore. This is where your chunks, metadata, and embeddings will live.
from elasticsearch import Elasticsearch
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
es_client = Elasticsearch(
"https://localhost:9200",
basic_auth=("elastic", "your_password"),
verify_certs=False,
)
document_store = ElasticsearchDocumentStore(
hosts="https://localhost:9200",
basic_auth=("elastic", "your_password"),
index="investment_banking_docs",
embedding_dim=384,
)
- •
Load and write banking documents into the store
In investment banking, raw documents are too large to retrieve directly. Chunk them first, then write them into Elasticsearch through Haystack’s
DocumentWriter.
from haystack import Document
from haystack.components.writers import DocumentWriter
docs = [
Document(
content="Company A reported revenue growth of 18% YoY driven by cloud subscriptions.",
meta={"source": "earnings_call", "ticker": "COMPANYA", "year": 2024},
),
Document(
content="The merger agreement includes a termination fee of $250 million.",
meta={"source": "deal_doc", "deal_type": "M&A", "counterparty": "TargetCo"},
),
]
writer = DocumentWriter(document_store=document_store)
writer.run(docs)
If you already have PDFs or text files, add a converter and splitter before writing. The pattern stays the same: normalize, chunk, enrich metadata, then index.
- •
Add an embedding model for semantic retrieval
For RAG, keyword search alone is not enough. Use a sentence embedding model so Haystack can retrieve semantically relevant passages from earnings calls or deal docs.
from haystack.components.embedders import SentenceTransformersTextEmbedder
text_embedder = SentenceTransformersTextEmbedder(
model="sentence-transformers/all-MiniLM-L6-v2"
)
query_embedding = text_embedder.run(text="What drove revenue growth in Company A?")
print(query_embedding["embedding"][:5])
In production, use the same embedding model for both indexing and querying. If those vectors drift, retrieval quality drops fast.
- •
Build a retrieval pipeline with Haystack + Elasticsearch
Now connect query embedding to Elasticsearch retrieval. Haystack’s
InMemoryEmbeddingRetrieveris not what you want here; useEmbeddingRetrieverbacked by theElasticsearchDocumentStore.
from haystack.components.retrievers import EmbeddingRetriever
from haystack import Pipeline
retriever = EmbeddingRetriever(document_store=document_store)
rag_pipeline = Pipeline()
rag_pipeline.add_component("query_embedder", text_embedder)
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.connect("query_embedder.embedding", "retriever.query_embedding")
result = rag_pipeline.run(
{
"query_embedder": {"text": "What drove revenue growth in Company A?"},
"retriever": {"top_k": 3},
}
)
for doc in result["retriever"]["documents"]:
print(doc.content)
This gives you ranked context from Elasticsearch using vector similarity. For banking use cases, keep metadata filters ready so analysts can scope by ticker, date range, sector, or document type.
- •
Add generation on top of retrieved context
Retrieval alone is not RAG. Pass the retrieved documents into an LLM prompt so the agent answers using evidence from your indexed corpus.
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
template = """
Answer the question using only the provided context.
Question: {{question}}
Context:
{% for doc in documents %}
- {{ doc.content }}
{% endfor %}
Answer:
"""
prompt_builder = PromptBuilder(template=template)
generator = OpenAIGenerator(model="gpt-4o-mini")
qa_pipeline = Pipeline()
qa_pipeline.add_component("query_embedder", text_embedder)
qa_pipeline.add_component("retriever", retriever)
qa_pipeline.add_component("prompt_builder", prompt_builder)
qa_pipeline.add_component("generator", generator)
qa_pipeline.connect("query_embedder.embedding", "retriever.query_embedding")
qa_pipeline.connect("retriever.documents", "prompt_builder.documents")
qa_pipeline.connect("prompt_builder.prompt", "generator.prompt")
response = qa_pipeline.run(
{
"query_embedder": {"text": "What drove revenue growth in Company A?"},
"retriever": {"top_k": 3},
"prompt_builder": {"question": "What drove revenue growth in Company A?"},
}
)
print(response["generator"]["replies"][0])
Testing the Integration
Run a simple smoke test against one known document and one known question. You want to confirm three things:
- •Elasticsearch stored the document
- •Haystack retrieved it semantically
- •The final answer cites the right context
test_query = "What was the termination fee in the merger agreement?"
result = qa_pipeline.run(
{
"query_embedder": {"text": test_query},
"retriever": {"top_k": 1},
"prompt_builder": {"question": test_query},
}
)
answer = result["generator"]["replies"][0]
print(answer)
Expected output:
The merger agreement includes a termination fee of $250 million.
If you get irrelevant results, check these first:
- •Your chunking size is too large or too small
- •Metadata filters are excluding valid docs
- •Your embedding model does not match your corpus language/style
- •The index was written without vector fields enabled
Real-World Use Cases
- •Deal desk copilot
- •Retrieve comparable transactions, precedent terms, and internal notes during live deal review.
- •Earnings intelligence assistant
- •Answer questions about guidance changes, margin pressure, capex plans, and segment performance from transcripts and filings.
- •Research knowledge base
- •Let analysts query internal reports with citations back to source documents stored in Elasticsearch.
This setup is stable because each layer does one job well. Haystack handles orchestration and retrieval logic; Elasticsearch handles indexing and search at scale.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit