How to Integrate Haystack for investment banking with Elasticsearch for AI agents

By Cyprian AaronsUpdated 2026-04-21
haystack-for-investment-bankingelasticsearchai-agents

Haystack for investment banking gives you the retrieval and pipeline layer for financial workflows. Elasticsearch gives you fast, indexed search over filings, research notes, term sheets, and market data. Put them together and your AI agent can answer banker-grade questions with grounded context instead of guessing.

Prerequisites

  • Python 3.10+
  • An Elasticsearch cluster running locally or in Elastic Cloud
  • An API key or username/password for Elasticsearch
  • Haystack installed with the Elasticsearch integration package
  • Access to your investment banking documents:
    • PDFs
    • DOCX files
    • CSVs
    • analyst notes
    • deal memos
  • Optional but useful:
    • OpenAI or another LLM provider for generation
    • A clean document chunking strategy for long filings

Install the packages:

pip install haystack-ai elasticsearch-haystack elasticsearch

Integration Steps

  1. Set up the Elasticsearch connection.

Start by connecting Haystack to your Elasticsearch index. In production, use a dedicated index per domain: banking-research, deal-docs, or market-intel.

from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(
    hosts="http://localhost:9200",
    index="investment-banking-docs",
    embedding_dim=384,
    similarity="cosine"
)

If your cluster requires authentication, pass credentials through the client config supported by the integration package.

from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(
    hosts=["https://my-es-cluster:9243"],
    basic_auth=("elastic", "your-password"),
    index="investment-banking-docs",
    embedding_dim=384,
)
  1. Load and write banking documents into Elasticsearch.

For investment banking use cases, you usually want to chunk long documents before indexing. Haystack’s Document model keeps metadata attached, which is critical for source traceability.

from haystack import Document

docs = [
    Document(
        content="Company X reported revenue growth of 18% YoY driven by enterprise SaaS expansion.",
        meta={
            "source": "Q4_earnings_note.pdf",
            "ticker": "COMPX",
            "doc_type": "earnings_note"
        }
    ),
    Document(
        content="The merger agreement includes a reverse termination fee of $120M.",
        meta={
            "source": "deal_memo.docx",
            "doc_type": "deal_memo",
            "deal_id": "MNA-2025-014"
        }
    )
]

document_store.write_documents(docs)

If you already have parsed chunks from PDFs or filings, write those directly. Keep metadata consistent so your agent can filter by ticker, deal ID, or document type later.

  1. Add embeddings so semantic retrieval works.

Elasticsearch can do vector search, but you still need embeddings. Use a Haystack embedder component to convert chunks into vectors before indexing.

from haystack.components.embedders import SentenceTransformersDocumentEmbedder

embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
embedder.warm_up()

embedded_docs = embedder.run(docs)["documents"]
document_store.write_documents(embedded_docs)

For query-time retrieval, use the matching query embedder:

from haystack.components.embedders import SentenceTransformersTextEmbedder

query_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
query_embedder.warm_up()
  1. Build a retrieval pipeline for the agent.

This is the part your AI agent will call. The retriever pulls relevant context from Elasticsearch using semantic similarity plus filters.

from haystack import Pipeline
from haystack.components.retrievers import EmbeddingRetriever

retriever = EmbeddingRetriever(document_store=document_store)

pipe = Pipeline()
pipe.add_component("query_embedder", query_embedder)
pipe.add_component("retriever", retriever)

pipe.connect("query_embedder.embedding", "retriever.query_embedding")

result = pipe.run({
    "query_embedder": {"text": "What was Company X's revenue growth and main driver?"},
})

If you want stricter banker workflows, filter on metadata:

result = retriever.run(
    query_embedding=query_embedder.run("Summarize deal risks")["embedding"],
    filters={"meta.doc_type": ["deal_memo"]}
)
  1. Attach retrieval to an AI agent response flow.

Once retrieval works, pass the top chunks into your generator component. This gives you grounded answers with citations from indexed banking docs.

from haystack.components.builders import PromptBuilder

prompt_builder = PromptBuilder(
    template="""
Answer using only the provided context.

Context:
{% for doc in documents %}
- {{ doc.content }} (source: {{ doc.meta.source }})
{% endfor %}

Question: {{ question }}
"""
)

prompt = prompt_builder.run(
    question="What are the key terms in the merger agreement?",
    documents=result["retriever"]["documents"]
)["prompt"]

print(prompt)

In a real agent, this prompt goes to your LLM client after retrieval. The important part is that Elasticsearch handles fast document lookup while Haystack manages orchestration.

Testing the Integration

Run a simple end-to-end check: write one document, retrieve it with a query, and confirm the right text comes back.

from haystack import Document
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.retrievers import EmbeddingRetriever

store = ElasticsearchDocumentStore(
    hosts="http://localhost:9200",
    index="ib-test-index",
    embedding_dim=384,
)

docs = [Document(content="Net debt increased due to acquisition financing.", meta={"source": "test_note"})]

doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder.warm_up()
store.write_documents(doc_embedder.run(docs)["documents"])

query_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
query_embedder.warm_up()

retriever = EmbeddingRetriever(document_store=store)

query_embedding = query_embedder.run("Why did net debt increase?")["embedding"]
hits = retriever.run(query_embedding=query_embedding, top_k=3)["documents"]

for doc in hits:
    print(doc.content)
    print(doc.meta)

Expected output:

Net debt increased due to acquisition financing.
{'source': 'test_note'}

If you get no hits:

  • confirm embeddings were written to the same index
  • check embedding_dim
  • verify your query and document embedders use the same model family

Real-World Use Cases

  • Deal support assistant

    • Retrieve merger terms, diligence notes, and risk flags from indexed deal rooms.
    • Useful for analysts preparing IC memos or Q&A responses.
  • Earnings and market intelligence bot

    • Search earnings transcripts, sell-side notes, and company filings.
    • Your agent can answer questions like “What changed in guidance?” with source-backed context.
  • Compliance-aware research assistant

    • Filter by document type, date range, or desk.
    • Keep responses grounded in approved internal research instead of open-web noise.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides