How to Build a document extraction Agent Using AutoGen in Python for fintech

By Cyprian AaronsUpdated 2026-04-21
document-extractionautogenpythonfintech

A document extraction agent takes PDFs, scans, statements, invoices, and onboarding forms, then turns them into structured JSON your downstream systems can use. In fintech, that matters because manual extraction is slow, inconsistent, and expensive, while bad extraction creates compliance risk, bad KYC data, and broken reconciliation.

Architecture

  • Input ingestion layer

    • Accepts PDFs, images, or text from S3, a file store, or an internal upload service.
    • Normalizes file metadata like source system, customer ID, jurisdiction, and retention policy.
  • OCR / text extraction tool

    • Converts scanned documents into raw text.
    • For production, this is usually an external service or internal OCR pipeline rather than the LLM itself.
  • Extraction agent

    • Uses AutoGen to read extracted text and return structured fields.
    • Focuses on deterministic output: account numbers, dates, amounts, names, addresses, entity type.
  • Validation agent

    • Checks schema completeness, field formats, and cross-field consistency.
    • Flags missing tax IDs, invalid IBANs, mismatched totals, or suspicious date sequences.
  • Audit and logging layer

    • Stores prompts, model outputs, validation results, and document hashes.
    • Required for traceability in regulated environments.
  • Human review fallback

    • Routes low-confidence or policy-sensitive cases to an analyst.
    • Needed for KYC/AML workflows and exception handling.

Implementation

1) Install AutoGen and define the schema

For AutoGen Python projects, use the autogen package and keep your output contract explicit. In fintech you want structured extraction first, natural language second.

pip install pyautogen pydantic
from pydantic import BaseModel
from typing import Optional

class BankStatementExtraction(BaseModel):
    customer_name: str
    account_number: str
    statement_date: str
    currency: str
    opening_balance: float
    closing_balance: float
    total_debits: float
    total_credits: float
    bank_name: Optional[str] = None

2) Create an assistant agent that extracts into JSON

AutoGen’s AssistantAgent can be instructed to produce strict JSON. Pair it with a UserProxyAgent to run the interaction and capture the result.

import os
import json
from autogen import AssistantAgent, UserProxyAgent

llm_config = {
    "config_list": [
        {
            "model": "gpt-4o-mini",
            "api_key": os.environ["OPENAI_API_KEY"],
        }
    ],
    "temperature": 0,
}

extractor = AssistantAgent(
    name="extractor",
    llm_config=llm_config,
    system_message=(
        "You extract structured fields from financial documents. "
        "Return only valid JSON matching the requested schema. "
        "Do not invent values. Use null for missing optional fields."
    ),
)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
)

document_text = """
Bank of Example
Customer Name: Jane Doe
Account Number: 123456789
Statement Date: 2024-12-31
Currency: USD
Opening Balance: 1000.00
Closing Balance: 1450.50
Total Debits: 250.00
Total Credits: 700.50
"""

message = f"""
Extract the following fields as JSON:
customer_name, account_number, statement_date,
currency, opening_balance, closing_balance,
total_debits, total_credits, bank_name

Document:
{document_text}
"""

result = user.initiate_chat(extractor, message=message)
print(result.chat_history[-1]["content"])

This pattern is simple but production-friendly. The important part is that the assistant is constrained to return a machine-readable object instead of a free-form summary.

3) Validate the model output before it hits downstream systems

Never trust raw LLM output in a fintech pipeline. Parse it with Pydantic and reject anything that does not match your contract.

from pydantic import ValidationError

raw_output = result.chat_history[-1]["content"]

try:
    data = json.loads(raw_output)
    parsed = BankStatementExtraction(**data)
    print(parsed.model_dump())
except (json.JSONDecodeError, ValidationError) as e:
    print(f"Invalid extraction payload: {e}")

If you need stronger control over retries or multi-step reasoning between agents, use GroupChat with a validator agent that critiques the extractor output before final acceptance. That gives you a clean separation between extraction and policy checks.

4) Add a validator agent for compliance-sensitive rules

A second agent can inspect extracted values for domain rules like date validity or suspicious balance math. This is where you catch obvious issues before they enter core banking workflows.

validator = AssistantAgent(
    name="validator",
    llm_config=llm_config,
    system_message=(
        "You validate extracted financial data for consistency. "
        "Check numeric relationships and flag missing required fields. "
        "Return only JSON with keys: valid (bool), issues (list)."
    ),
)

validation_prompt = f"""
Validate this extraction:
{raw_output}

Rules:
- opening_balance + total_credits - total_debits should equal closing_balance within rounding tolerance.
- account_number must be present.
- statement_date must be ISO format YYYY-MM-DD.
"""

validation_result = user.initiate_chat(validator, message=validation_prompt)
print(validation_result.chat_history[-1]["content"])

Production Considerations

  • Data residency

    • Keep document processing in-region if your regulator requires it.
    • If documents contain PII or financial identifiers, make sure model calls do not violate cross-border transfer rules.
  • Auditability

    • Store document hash, prompt versioning, model version, extracted payloads, validation outcome.
    • You need to reconstruct why a field was accepted or rejected during audits or disputes.
  • Guardrails

    • Enforce schema validation after every model call.
    • Block free-text responses from entering core systems; only accept typed JSON objects.
  • Monitoring

    • Track field-level accuracy by document type.
    • Watch for drift in OCR quality, vendor template changes, and rising human-review rates.

Common Pitfalls

  1. Letting the agent “interpret” instead of extract

    • Mistake: asking for summaries or explanations alongside fields.
    • Fix: separate extraction from analysis; keep prompts strict and schema-driven.
  2. Skipping deterministic validation

    • Mistake: trusting the LLM because it returned plausible JSON.
    • Fix: parse with Pydantic or equivalent validators before any downstream write operation.
  3. Ignoring document-type variance

    • Mistake: using one prompt for bank statements, invoices, and onboarding forms.
    • Fix: build per-document templates and per-schema agents so each workflow has explicit field expectations.
  4. No human fallback on low-confidence cases

    • Mistake: auto-posting every extraction into KYC or ledger systems.
    • Fix: route ambiguous outputs to review queues with confidence thresholds and audit notes attached.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides