How to Build a document extraction Agent Using AutoGen in Python for banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionautogenpythonbanking

A document extraction agent for banking takes messy inputs like loan applications, bank statements, KYC forms, and proof-of-income PDFs, then turns them into structured JSON your downstream systems can trust. The point is not just OCR; it is extracting the right fields, validating them against policy, and producing an auditable trail for compliance, underwriting, and operations.

Architecture

  • Document intake layer

    • Accepts PDF, image, or text uploads from a secure internal endpoint.
    • Normalizes file metadata: customer ID, document type, timestamp, source system.
  • Extraction agent

    • Uses an AssistantAgent to read the document text and return structured fields.
    • Enforces a fixed schema for banking objects like applicant name, account number, income, employer, and dates.
  • Validation agent

    • Uses a second AssistantAgent to check extracted values against business rules.
    • Flags missing fields, inconsistent dates, suspicious amounts, or low-confidence outputs.
  • Orchestrator

    • Uses UserProxyAgent to manage the conversation and execute deterministic Python tools.
    • Controls when to call OCR, parsing utilities, schema validation, and persistence.
  • Audit and persistence layer

    • Stores raw input references, extracted JSON, validation results, model version, and timestamps.
    • Required for traceability under banking compliance and internal model governance.

Implementation

1) Install AutoGen and define the extraction schema

Use a strict schema first. In banking workflows, free-form output is how bad data gets into core systems.

from pydantic import BaseModel, Field
from typing import Optional

class BankDocumentExtraction(BaseModel):
    document_type: str = Field(..., description="e.g. bank_statement, payslip, kyc_form")
    full_name: Optional[str] = None
    account_number: Optional[str] = None
    employer_name: Optional[str] = None
    monthly_income: Optional[float] = None
    statement_period: Optional[str] = None
    issue_date: Optional[str] = None
    confidence_notes: Optional[str] = None

2) Create agents with actual AutoGen classes

This pattern uses AssistantAgent for extraction and review, plus UserProxyAgent as the execution controller. The user proxy can call local Python functions for OCR or file parsing if you wire them in later.

import os
import autogen

llm_config = {
    "config_list": [
        {
            "model": "gpt-4o-mini",
            "api_key": os.environ["OPENAI_API_KEY"],
        }
    ],
    "temperature": 0,
}

extractor = autogen.AssistantAgent(
    name="extractor",
    llm_config=llm_config,
    system_message=(
        "Extract banking document fields into valid JSON only. "
        "Do not invent values. If a field is missing, use null."
    ),
)

validator = autogen.AssistantAgent(
    name="validator",
    llm_config=llm_config,
    system_message=(
        "Validate extracted banking document JSON against policy. "
        "Return issues only. Focus on missing fields and inconsistencies."
    ),
)

user_proxy = autogen.UserProxyAgent(
    name="bank_ops",
    human_input_mode="NEVER",
    code_execution_config=False,
)

3) Run extraction and validation as a controlled two-agent flow

In production you want one agent to extract and another to critique. That separation reduces hallucinated fields and makes audit easier because you can store both outputs independently.

document_text = """
Bank Statement
Name: Priya Ndlovu
Account Number: 0048392011
Statement Period: 2024-01-01 to 2024-01-31
Monthly Income Credit: ZAR 42,500.00
Employer: Topiax Consulting
Issue Date: 2024-02-02
"""

extract_prompt = f"""
Extract the following document into JSON matching this schema:
{BankDocumentExtraction.model_json_schema()}

Document:
{document_text}
"""

chat_result = user_proxy.initiate_chat(
    extractor,
    message=extract_prompt,
)

extracted_text = chat_result.chat_history[-1]["content"]
print(extracted_text)

validation_prompt = f"""
Review this extracted JSON for banking quality issues:
{extracted_text}

Check:
- missing required fields
- invalid dates
- suspicious amounts
- format problems with account numbers

Return a short list of issues only.
"""

validation_result = user_proxy.initiate_chat(
    validator,
    message=validation_prompt,
)

print(validation_result.chat_history[-1]["content"])

4) Parse the output before persisting it

Do not write raw model text directly into your database. Parse it into a typed object first so your application can reject malformed output before it hits downstream systems.

import json

def parse_extraction(raw_text: str) -> BankDocumentExtraction:
    data = json.loads(raw_text)
    return BankDocumentExtraction(**data)

# Example usage after you get clean JSON back from the extractor:
# record = parse_extraction(extracted_text)
# db.save(record.model_dump())

Production Considerations

  • Data residency

    • Keep OCR text and extracted payloads in-region if your banking policy requires it.
    • If your LLM endpoint is externalized, confirm where prompts and logs are stored.
  • Auditability

    • Persist the original document hash, extracted JSON, validation output, agent names, model version, and timestamps.
    • You need this when compliance asks why a loan decision used a specific field value.
  • Guardrails

    • Reject outputs that fail schema validation.
    • Add deterministic checks for account number formats, date ranges, currency normalization, and mandatory KYC fields before any workflow continues.
  • Monitoring

    • Track extraction failure rate by document type.
    • Watch for drift in field completeness after template changes from banks or employers.

Common Pitfalls

  1. Letting the model free-write instead of enforcing structure

    • Bad output becomes expensive fast.
    • Avoid this by requiring JSON-only responses and validating with Pydantic before persistence.
  2. Skipping deterministic checks

    • LLMs are good at reading context but weak at policy enforcement.
    • Always verify things like date order, currency format, checksum rules where applicable, and required KYC fields in code.
  3. Ignoring compliance boundaries

    • Banking documents often contain PII and regulated data.
    • Redact unnecessary fields before sending content to the model when possible, log access events, and keep an audit trail that maps every extracted value back to its source document.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides