How to Build a document extraction Agent Using AutoGen in Python for investment banking

By Cyprian AaronsUpdated 2026-04-21

document-extractionautogenpythoninvestment-banking

A document extraction agent for investment banking takes messy deal documents — pitch books, CIMs, credit agreements, KYC packs, financial statements, term sheets — and turns them into structured data your downstream systems can trust. That matters because bankers and analysts spend too much time manually copying fields, reconciling versions, and checking whether the extracted numbers match the source. If you automate this poorly, you create audit risk; if you do it right, you cut turnaround time without losing control.

Architecture

•
Document ingestion layer
- •Accept PDFs, DOCX, scanned images, and email attachments.
- •Normalize files into text plus page-level metadata before any LLM call.
•
Extraction agent
- •Uses an AutoGen AssistantAgent to identify entities like issuer names, deal size, maturity dates, covenants, fees, and counterparties.
- •Produces strict JSON output aligned to a schema.
•
Validation agent
- •Uses a second AutoGen agent to verify extracted fields against the source text.
- •Flags missing fields, low-confidence values, and inconsistencies across pages.
•
Human review loop
- •Routes exceptions to an analyst when confidence is low or compliance-sensitive fields are present.
- •Keeps an auditable record of what was extracted and what was overridden.
•
Persistence layer
- •Stores raw documents, extracted JSON, validation results, and trace metadata.
- •Supports retention policies and data residency requirements.

Implementation

1) Install AutoGen and define the extraction schema

For investment banking work, don’t let the model free-form answer. Define the output contract first so every document maps to the same structure.

from pydantic import BaseModel
from typing import Optional

class DealExtraction(BaseModel):
    deal_name: Optional[str] = None
    issuer: Optional[str] = None
    adviser: Optional[str] = None
    currency: Optional[str] = None
    deal_size: Optional[str] = None
    maturity_date: Optional[str] = None
    coupon: Optional[str] = None
    governing_law: Optional[str] = None

2) Create AutoGen agents with explicit roles

Use AssistantAgent for extraction and verification. Use a UserProxyAgent to drive the workflow from your application code.

import os
from autogen import AssistantAgent, UserProxyAgent

llm_config = {
    "config_list": [
        {
            "model": "gpt-4o-mini",
            "api_key": os.environ["OPENAI_API_KEY"],
        }
    ],
    "temperature": 0,
}

extractor = AssistantAgent(
    name="extractor",
    llm_config=llm_config,
    system_message=(
        "You extract structured fields from investment banking documents. "
        "Return only valid JSON matching the requested schema. "
        "If a field is absent, use null."
    ),
)

validator = AssistantAgent(
    name="validator",
    llm_config=llm_config,
    system_message=(
        "You verify extracted banking fields against the source text. "
        "Return a concise JSON object with 'is_valid', 'issues', and 'suggested_fixes'."
    ),
)

user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
)

3) Run extraction and validation on document text

In production you should OCR or parse files before this step. The example below assumes you already have clean text from a PDF or DOCX parser.

document_text = """
Issuer: Northbridge Capital plc
Adviser: Topiax Securities Ltd.
Transaction: €500 million senior secured notes due 2031
Coupon: 7.25%
Governing law: English law
"""

extract_prompt = f"""
Extract the following fields from this investment banking document:
deal_name, issuer, adviser, currency, deal_size, maturity_date, coupon, governing_law.

Document:
{document_text}

Return only JSON.
"""

extraction_result = user_proxy.initiate_chat(
    extractor,
    message=extract_prompt,
)

print(extraction_result.summary)

That gives you a conversation transcript. In practice you parse the final assistant message into your DealExtraction model and reject anything that fails schema validation.

4) Add a verification pass before writing to your system of record

This is where most teams skip discipline and pay for it later. A second agent checks whether extracted values are actually supported by the text.

validation_prompt = f"""
Source text:
{document_text}

Extracted JSON:
{extraction_result.summary}

Check whether each field is supported by the source text.
Return JSON with keys:
is_valid (boolean),
issues (array of strings),
suggested_fixes (object).
"""

validation_result = user_proxy.initiate_chat(
    validator,
    message=validation_prompt,
)

print(validation_result.summary)

If validation fails, route the item to manual review instead of pushing it downstream into CRM or deal tracking systems. In banking workflows that is usually cheaper than cleaning up bad data after launch.

Production Considerations

•
Compliance logging
- •Store prompts, responses, document hashes, model version, timestamps, and reviewer actions.
- •You need this for auditability when compliance asks why a field was extracted a certain way.
•
Data residency
- •Keep documents in-region if your bank has jurisdictional constraints.
- •If policy prohibits sending client data to external APIs outside approved regions, use an approved deployment path or private model endpoint.
•
Guardrails on sensitive fields
- •Detect PII, MNPI indicators, account numbers, tax IDs, and legal clauses before sending content to the model.
- •Redact or isolate sensitive sections when policy requires it.
•
Operational monitoring
- •Track extraction accuracy by document type: pitch books behave differently from credit agreements.
- •Monitor null rates, validation failure rates, manual override rates, and latency per page.

Common Pitfalls

•
Letting the model output prose instead of structured data
- •Fix it by enforcing JSON-only responses and validating with Pydantic before persistence.
- •If parsing fails once in production finance workflows will surface it immediately.
•
Skipping source-grounding checks
- •Fix it by running a validation agent against the original text and rejecting unsupported fields.
- •Never trust extracted numbers without traceability back to page or paragraph evidence.
•
Ignoring document-type variation
- •Fix it by using different prompts or schemas for term sheets, financial statements, KYC packs, and board materials.
- •A single generic prompt usually degrades fast once you move beyond clean examples.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit