How to Build a document extraction Agent Using AutoGen in Python for retail banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionautogenpythonretail-banking

A document extraction agent in retail banking reads customer documents like bank statements, payslips, IDs, proof of address, and loan forms, then turns them into structured fields your downstream systems can use. It matters because onboarding, lending, KYC, and servicing all depend on fast, accurate extraction with an audit trail that stands up to compliance review.

Architecture

  • Document intake service

    • Accepts PDFs, images, and scanned files from branch ops, digital onboarding, or case management systems.
    • Stores the raw file in a controlled location before any model call.
  • Preprocessing layer

    • Handles OCR if needed, page splitting, image cleanup, and text normalization.
    • Flags low-quality scans early so you do not feed garbage into the agent.
  • AutoGen agent orchestration

    • Uses AssistantAgent for extraction logic and UserProxyAgent to execute tool calls and control the loop.
    • Keeps the interaction bounded so the model cannot wander into unrelated analysis.
  • Extraction tools

    • Parse PDF text, run OCR, validate dates/amounts/account numbers, and map outputs to a fixed schema.
    • Return structured JSON only.
  • Validation and compliance layer

    • Checks required fields, confidence thresholds, PII handling rules, and jurisdiction-specific retention policies.
    • Writes immutable logs for audit.
  • Downstream integration

    • Pushes validated output into KYC systems, loan origination platforms, or case management queues.
    • Routes exceptions to human review.

Implementation

  1. Install AutoGen and define your extraction schema

Use pyautogen with a strict output contract. For retail banking, do not let the agent invent fields; make it produce only what your workflow expects.

from autogen import AssistantAgent, UserProxyAgent
from pydantic import BaseModel, Field
from typing import Optional

class BankDocumentExtraction(BaseModel):
    document_type: str = Field(..., description="passport|id_card|bank_statement|payslip|utility_bill")
    full_name: Optional[str] = None
    account_number: Optional[str] = None
    sort_code: Optional[str] = None
    address: Optional[str] = None
    issue_date: Optional[str] = None
    expiry_date: Optional[str] = None
    employer_name: Optional[str] = None
    net_pay: Optional[str] = None
  1. Create an assistant agent with a strict system message

The system message should force extraction-only behavior. In banking workflows you want determinism over creativity.

assistant = AssistantAgent(
    name="doc_extractor",
    llm_config={
        "config_list": [
            {
                "model": "gpt-4o-mini",
                "api_key": "${OPENAI_API_KEY}",
            }
        ],
        "temperature": 0,
    },
    system_message=(
        "You extract structured data from retail banking documents. "
        "Return only valid JSON matching the requested schema. "
        "Do not summarize. Do not infer missing values. "
        "If a field is absent or unreadable, return null."
    ),
)
  1. Use a user proxy to run the conversation and validate output

UserProxyAgent is the control point. In production this is where you enforce tool execution policy, human approval gates, and retries on malformed output.

import json

user_proxy = UserProxyAgent(
    name="bank_ops",
    human_input_mode="NEVER",
    code_execution_config=False,
)

document_text = """
Customer Name: Priya Nair
Account No: 12345678
Sort Code: 12-34-56
Address: 14 King Street, London
Document Type: Bank Statement
"""

prompt = f"""
Extract fields from this retail banking document text.
Return JSON only with keys:
document_type, full_name, account_number, sort_code,
address, issue_date, expiry_date, employer_name, net_pay

Document:
{document_text}
"""

result = user_proxy.initiate_chat(
    assistant,
    message=prompt,
)

print(result.chat_history[-1]["content"])
  1. Wrap extraction in a production function with validation

You need a function that calls AutoGen and then validates against your schema before anything lands in core banking or CRM systems.

from pydantic import ValidationError

def extract_banking_fields(document_text: str) -> dict:
    prompt = f"""
Extract fields from this retail banking document text.
Return JSON only with keys:
document_type, full_name, account_number, sort_code,
address, issue_date, expiry_date, employer_name, net_pay

Document:
{document_text}
"""
    chat_result = user_proxy.initiate_chat(assistant, message=prompt)
    content = chat_result.chat_history[-1]["content"]

    data = json.loads(content)
    parsed = BankDocumentExtraction(**data)
    return parsed.model_dump()

try:
    extracted = extract_banking_fields(document_text)
    print(extracted)
except (json.JSONDecodeError, ValidationError) as e:
    print(f"Extraction failed validation: {e}")

Production Considerations

  • Keep data residency explicit

    • Route UK customer documents to UK-hosted infrastructure if your policy requires it.
    • Do not send regulated documents across regions just because the model endpoint is available there.
  • Log for audit without leaking PII

    • Store request IDs, document hashes, model version, prompt version, and validation outcome.
    • Mask account numbers and national identifiers in application logs.
  • Add guardrails around confidence and exceptions

    • Reject outputs with missing mandatory fields like name or account number when the document type requires them.
    • Send low-quality scans or ambiguous OCR results to manual review instead of guessing.
  • Monitor drift by document type

    • Track failure rates separately for bank statements, utility bills, payslips, and IDs.
    • A spike in one category usually means OCR degradation or template changes upstream.

Common Pitfalls

  • Letting the model free-form the response

    • If you accept prose instead of JSON you will spend time cleaning bad outputs.
    • Fix it by enforcing a schema and validating with Pydantic before persistence.
  • Skipping OCR quality checks

    • Blurry scans produce confident nonsense.
    • Run preprocessing first and reject pages below your quality threshold before calling AutoGen.
  • Ignoring compliance boundaries

    • Sending raw customer documents to unmanaged logs or non-approved regions creates audit problems fast.
    • Fix it with redaction middleware, retention controls, approved model endpoints only once legal sign-off is complete.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides