How to Build a document extraction Agent Using AutoGen in Python for lending

By Cyprian AaronsUpdated 2026-04-21
document-extractionautogenpythonlending

A document extraction agent for lending takes raw borrower documents — bank statements, payslips, IDs, tax returns, business financials — and turns them into structured fields your underwriting system can use. It matters because lending decisions are only as good as the data behind them: faster extraction reduces manual ops work, but in this domain you also need traceability, compliance, and strict control over what the model is allowed to do.

Architecture

  • Document intake layer

    • Receives PDFs, images, and scans from the loan application flow.
    • Normalizes file metadata like applicant ID, document type, and source channel.
  • OCR / text extraction layer

    • Converts scanned pages into text before the LLM sees them.
    • For production lending systems, keep this deterministic and versioned.
  • AutoGen agent layer

    • Uses AssistantAgent to extract fields into a strict schema.
    • Uses UserProxyAgent to execute tools and validate outputs.
  • Validation and policy layer

    • Checks extracted values against business rules:
      • income must be numeric
      • dates must be valid
      • identity fields must match expected formats
    • Rejects or flags low-confidence outputs for human review.
  • Audit logging layer

    • Stores input document hashes, model responses, schema versions, and decision traces.
    • Required for underwriting audits and dispute handling.
  • Persistence / integration layer

    • Writes structured output to your LOS or underwriting service.
    • Keeps raw documents in compliant storage with residency controls.

Implementation

1) Install AutoGen and define a strict extraction schema

For lending, do not ask the model for “summary” output. Force it into a JSON contract that your pipeline can validate.

from pydantic import BaseModel, Field
from typing import Optional

class BorrowerDocumentFields(BaseModel):
    document_type: str = Field(..., description="bank_statement, payslip, id_card, tax_return")
    full_name: Optional[str] = None
    document_number: Optional[str] = None
    employer_name: Optional[str] = None
    monthly_income: Optional[float] = None
    statement_period_start: Optional[str] = None
    statement_period_end: Optional[str] = None

2) Create an AutoGen assistant that extracts only structured fields

This pattern uses AssistantAgent for reasoning and UserProxyAgent to trigger execution. The assistant is instructed to return JSON matching the schema.

import json
from autogen import AssistantAgent, UserProxyAgent

llm_config = {
    "config_list": [
        {
            "model": "gpt-4o-mini",
            "api_key": "YOUR_OPENAI_API_KEY",
        }
    ],
    "temperature": 0,
}

extractor = AssistantAgent(
    name="doc_extractor",
    llm_config=llm_config,
    system_message=(
        "You extract lending document fields. "
        "Return ONLY valid JSON. "
        "Do not add commentary. "
        "If a field is missing, set it to null."
    ),
)

user_proxy = UserProxyAgent(
    name="validator",
    human_input_mode="NEVER",
    code_execution_config=False,
)

document_text = """
ACME BANK STATEMENT
Name: Jane Doe
Account Number: 12345678
Statement Period: 2024-01-01 to 2024-01-31
Net Monthly Income: $5,250.00
"""

prompt = f"""
Extract these fields from the document text:
- document_type
- full_name
- document_number
- employer_name
- monthly_income
- statement_period_start
- statement_period_end

Document text:
{document_text}
"""

result = user_proxy.initiate_chat(
    extractor,
    message=prompt,
)

print(result.chat_history[-1]["content"])

That gets you the core agent loop. In production you would replace document_text with OCR output from your ingestion service.

3) Parse and validate the model output before storing it

Do not trust raw LLM output in lending. Parse it with a schema validator and reject anything malformed.

from pydantic import ValidationError

raw_output = result.chat_history[-1]["content"]

try:
    parsed = BorrowerDocumentFields.model_validate(json.loads(raw_output))
except (json.JSONDecodeError, ValidationError) as e:
    raise ValueError(f"Invalid extraction payload: {e}")

if parsed.document_type == "bank_statement" and parsed.monthly_income is None:
    raise ValueError("Bank statement missing monthly_income")

print(parsed.model_dump())

4) Add a tool for deterministic rule checks

Use tools when you need deterministic validation. AutoGen agents can call Python functions through chat; keep these checks outside the model.

def verify_income(value):
    if value is None or value <= 0:
        return {"ok": False, "reason": "income must be positive"}
    if value > 1000000:
        return {"ok": False, "reason": "income out of expected range"}
    return {"ok": True}

check = verify_income(parsed.monthly_income)
print(check)

Production Considerations

  • Auditability

    • Store the original OCR text hash, prompt version, model name, response payload, validation result, and reviewer outcome.
    • Underwriters need a clear trail when an applicant disputes a decision.
  • Data residency

    • Keep borrower documents in-region if your lending policy or regulator requires it.
    • Make sure any external LLM endpoint matches your residency constraints or route through approved infrastructure.
  • Guardrails

    • Never let the agent approve loans or alter credit policy.
    • Restrict it to extraction plus confidence tagging; all adverse actions should stay in deterministic systems or human review queues.
  • Monitoring

    • Track field-level accuracy by document type.
    • Watch for drift in OCR quality, missing values on new template versions, and spikes in manual overrides.

Common Pitfalls

  1. Letting the model free-form the output

    • This breaks downstream parsing fast.
    • Fix it by enforcing JSON-only responses and validating with Pydantic before persistence.
  2. Mixing extraction with decisioning

    • If the same agent extracts data and recommends approval/decline, you create compliance risk.
    • Keep extraction separate from underwriting rules and credit policy engines.
  3. Ignoring document provenance

    • If you cannot prove where a field came from on the source page, audits get messy.
    • Fix it by storing page references, OCR confidence scores, and document hashes alongside extracted fields.
  4. Skipping region and retention controls

    • Lending data often includes highly sensitive personal information.
    • Define retention windows, encryption at rest/in transit, access controls, and region-specific storage before going live.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides