How to Build a document extraction Agent Using LlamaIndex in Python for investment banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionllamaindexpythoninvestment-banking

A document extraction agent for investment banking reads deal documents, term sheets, CIMs, credit agreements, and KYC packs, then turns unstructured text into structured fields your downstream systems can use. That matters because bankers spend too much time copying data between PDFs, spreadsheets, and internal systems, and every manual handoff adds risk around compliance, auditability, and missed terms.

Architecture

  • Document ingestion layer

    • Pulls PDFs, DOCX files, and scanned images from approved storage.
    • Normalizes files into LlamaIndex Document objects.
  • Text extraction and chunking

    • Uses LlamaIndex readers and node parsers to split long documents into manageable chunks.
    • Keeps page-level metadata so extracted fields can be traced back to source pages.
  • Extraction schema

    • Defines the exact fields you want: borrower name, facility amount, maturity date, governing law, covenants, fees.
    • Forces consistent output instead of free-form summaries.
  • LLM-backed extractor

    • Uses a structured extraction pipeline with OpenAI or another LLM supported by LlamaIndex.
    • Converts raw text into typed Python objects or JSON-like records.
  • Validation and audit layer

    • Checks required fields, date formats, currency values, and confidence thresholds.
    • Stores source citations for every extracted field.
  • Persistence layer

    • Writes results to PostgreSQL, S3, or a document store with immutable audit logs.
    • Supports replay for model governance and compliance review.

Implementation

1. Install the core packages

You need LlamaIndex plus a reader for your file types. For PDFs in investment banking workflows, PyMuPDF is a practical default because it handles many real-world deal docs well.

pip install llama-index llama-index-readers-file pydantic pymupdf

Set your model key through environment variables before running extraction:

export OPENAI_API_KEY="your-key"

2. Load the document into LlamaIndex

Use a file reader to turn the PDF into Document objects. Keep metadata like filename and page number so you can trace every extracted value back to source material during audit review.

from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(
    input_dir="./deal_docs",
    required_exts=[".pdf"],
    recursive=False,
)

documents = reader.load_data()
print(f"Loaded {len(documents)} documents")
print(documents[0].metadata)

If your intake comes from a controlled bucket or DMS, replace SimpleDirectoryReader with your own loader that constructs Document instances directly. The important part is preserving metadata such as deal_id, source_system, page_label, and ingested_at.

3. Define the extraction schema and run structured extraction

For banking use cases, do not ask the model to “summarize” the document. Define the exact fields you want using Pydantic so you get typed output that can be validated downstream.

from pydantic import BaseModel, Field
from typing import Optional
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.core.indices.struct_store import JSONQueryEngine

class CreditAgreementFields(BaseModel):
    borrower_name: str = Field(description="Legal name of the borrower")
    facility_amount: Optional[str] = Field(description="Committed facility amount")
    maturity_date: Optional[str] = Field(description="Final maturity date in ISO format if possible")
    governing_law: Optional[str] = Field(description="Governing law jurisdiction")
    arranger: Optional[str] = Field(description="Lead arranger or bookrunner")
    covenant_summary: Optional[str] = Field(description="Short summary of key financial covenants")

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

# Example of extracting from one loaded document
doc_text = documents[0].text

prompt = f"""
Extract the following fields from this investment banking document:
{CreditAgreementFields.model_json_schema()}

Document:
{doc_text}
"""

response = Settings.llm.complete(prompt)
print(response.text)

That pattern works for a first pass, but for production you want explicit parsing rather than raw text output. A stronger approach is to wrap the response in a parser and validate it with Pydantic before saving.

import json
from pydantic import ValidationError

raw = response.text.strip()
data = json.loads(raw)
try:
    extracted = CreditAgreementFields(**data)
    print(extracted.model_dump())
except ValidationError as e:
    print("Validation failed:", e)

4. Add source traceability for audit

Investment banking teams will ask where each field came from. You should store both the extracted record and the supporting evidence location.

A practical pattern is to keep page-level references in metadata and persist them alongside the final JSON record.

record = {
    "deal_id": documents[0].metadata.get("deal_id", "unknown"),
    "source_file": documents[0].metadata.get("file_name", "unknown"),
    "extracted": extracted.model_dump(),
    "source_metadata": documents[0].metadata,
}

print(record)

If you need more robust retrieval over large deal books, create an index with VectorStoreIndex.from_documents() and query specific sections before extraction. That reduces token usage on long files like credit agreements or offering memoranda.

Production Considerations

  • Compliance controls

    • Restrict model access to approved environments only.
    • Log prompts, outputs, user identity, timestamp, model version, and document hash for auditability.
    • Keep retention aligned with legal hold requirements and internal records policy.
  • Data residency

    • Keep sensitive deal data in-region if your bank has residency constraints.
    • If you use hosted models, confirm where prompts and outputs are processed and stored.
    • For highly sensitive M&A or financing docs, consider private deployment or an approved enterprise endpoint.
  • Monitoring

    • Track extraction accuracy by field type: dates, amounts, parties, covenants.
    • Alert on low-confidence parses or schema validation failures.
    • Sample outputs regularly against human-reviewed ground truth from analysts or legal ops.
  • Guardrails

    • Reject outputs that fail schema validation.
    • Force deterministic settings like low temperature.
    • Block unsupported actions such as generating investment advice or making legal interpretations beyond extraction scope.

Common Pitfalls

  1. Using summarization instead of structured extraction

    • Summaries look good in demos but break downstream systems.
    • Avoid this by defining a strict schema with Pydantic and validating every response.
  2. Dropping source metadata

    • If you lose page numbers or file names, audit review becomes painful.
    • Preserve document metadata at ingestion time and persist it with each extracted field set.
  3. Ignoring scanned PDFs and OCR quality

    • Many banking docs are scans with tables and signatures.
    • Use OCR-capable ingestion for image-based PDFs and add quality checks before extraction.
  4. Treating all documents the same

    • A term sheet is not a credit agreement is not a CIM.
    • Route by document type first, then apply type-specific schemas so you extract the right fields without noise.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides