How to Integrate Haystack for investment banking with Elasticsearch for startups

By Cyprian AaronsUpdated 2026-04-21
haystack-for-investment-bankingelasticsearchstartups

Combining Haystack for investment banking with Elasticsearch gives you a practical retrieval layer for financial documents, market research, filings, and internal deal notes. In a startup AI agent system, this setup lets you answer analyst questions with grounded evidence instead of free-form guesses, while keeping search fast enough for interactive workflows.

Prerequisites

  • Python 3.10+
  • An Elasticsearch cluster running locally or in the cloud
  • API credentials for Elasticsearch
  • Haystack installed in your project
  • Access to the Haystack components you plan to use:
    • Document
    • InMemoryDocumentStore or an Elasticsearch-backed document store
    • Pipeline
    • retriever and generator components
  • A .env file or secret manager for credentials
  • Sample investment banking documents:
    • pitch decks
    • earnings summaries
    • company profiles
    • deal notes

Integration Steps

  1. Install the dependencies.
pip install haystack-ai elasticsearch python-dotenv

If you are using an Elasticsearch-backed store in Haystack, install the integration package your version requires. The exact package name depends on the Haystack release line, so pin versions in requirements.txt before moving to production.

  1. Connect to Elasticsearch and create a document store.
import os
from dotenv import load_dotenv

from elasticsearch import Elasticsearch
from haystack import Document

load_dotenv()

es_client = Elasticsearch(
    os.environ["ELASTICSEARCH_URL"],
    basic_auth=(
        os.environ["ELASTICSEARCH_USERNAME"],
        os.environ["ELASTICSEARCH_PASSWORD"],
    ),
)

# Example: create a raw index if you want to inspect data directly in Elasticsearch.
index_name = "investment-banking-docs"

if not es_client.indices.exists(index=index_name):
    es_client.indices.create(index=index_name)

print(es_client.info())

For Haystack, use the document store that matches your deployment. If your version includes an Elasticsearch-backed store, wire it to the same cluster so retrieval and indexing share one source of truth.

  1. Index investment banking documents into Haystack and push them into Elasticsearch.
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy

docs = [
    Document(
        content="Company A reported $120M revenue in FY2024 with 18% EBITDA margin.",
        meta={"ticker": "CMPA", "type": "earnings_note", "source": "internal"},
    ),
    Document(
        content="Comparable companies show EV/Revenue multiples between 4.2x and 6.1x.",
        meta={"ticker": "CMPA", "type": "valuation_note", "source": "research"},
    ),
]

# Replace this with your actual Elasticsearch-backed document store class.
# Example pattern used across Haystack integrations:
# document_store = ElasticsearchDocumentStore(...)
document_store = None

writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)
writer.run(docs)

If you are indexing directly through Elasticsearch instead of a Haystack store, keep the same document schema: content plus metadata fields like ticker, sector, deal stage, and source.

  1. Build a retrieval pipeline for analyst-style queries.
from haystack import Pipeline
from haystack.components.retrievers import InMemoryBM25Retriever

# If you're using an Elasticsearch-backed Haystack document store,
# swap InMemoryBM25Retriever for the retriever supported by that store.
retriever = InMemoryBM25Retriever(document_store=document_store)

pipeline = Pipeline()
pipeline.add_component("retriever", retriever)

query = "What is Company A's valuation range based on comparable companies?"
result = pipeline.run(
    {
        "retriever": {
            "query": query,
            "top_k": 3,
        }
    }
)

for doc in result["retriever"]["documents"]:
    print(doc.content)
    print(doc.meta)

For startup AI agents, this is the key pattern: retrieve first, then generate from retrieved evidence. That keeps outputs aligned with source documents instead of hallucinated finance jargon.

  1. Add a generator step so the agent can answer using retrieved evidence.
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIChatGenerator

prompt_template = """
You are an investment banking analyst assistant.
Answer only from the provided documents.

Question: {{ question }}

Documents:
{% for doc in documents %}
- {{ doc.content }}
{% endfor %}

Answer:
"""

prompt_builder = PromptBuilder(template=prompt_template)
generator = OpenAIChatGenerator(model="gpt-4o-mini")

rag_pipeline = Pipeline()
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("generator", generator)

rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "generator.messages")

response = rag_pipeline.run(
    {
        "retriever": {"query": query, "top_k": 3},
        "prompt_builder": {"question": query},
    }
)

print(response["generator"]["replies"][0].content)

This gives you a clean RAG loop for banking workflows: search Elasticsearch for relevant docs, pass them through Haystack, then generate a controlled answer.

Testing the Integration

Use one known query and one known document set. The test should confirm both retrieval and answer generation are grounded in indexed content.

test_query = "What EBITDA margin did Company A report?"

test_result = rag_pipeline.run(
    {
        "retriever": {"query": test_query, "top_k": 2},
        "prompt_builder": {"question": test_query},
    }
)

answer = test_result["generator"]["replies"][0].content
print(answer)

Expected output:

Company A reported an EBITDA margin of 18% in FY2024.

If retrieval is working but the answer is vague, check these first:

  • Your documents contain the exact financial facts you expect to retrieve
  • Metadata filters are not excluding relevant records
  • The retriever is connected to the correct document store or index
  • Your prompt instructs the model to answer only from retrieved context

Real-World Use Cases

  • Deal team Q&A bot that answers questions from CIMs, diligence notes, and market comps stored in Elasticsearch.
  • Earnings call assistant that pulls transcript snippets and summarizes guidance changes for bankers.
  • Internal research copilot that searches sector reports, valuation notes, and client memos with low-latency retrieval.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides