How to Build a document extraction Agent Using AutoGen in TypeScript for lending

By Cyprian AaronsUpdated 2026-04-21
document-extractionautogentypescriptlending

A document extraction agent for lending reads borrower documents like bank statements, payslips, tax returns, and IDs, then turns them into structured fields your underwriting system can use. It matters because lending decisions depend on fast, accurate, auditable extraction, and manual review is slow, expensive, and inconsistent.

Architecture

  • Document intake layer

    • Accepts PDFs, images, and scans from the loan application flow.
    • Normalizes file metadata like borrower ID, application ID, and document type.
  • OCR and text normalization

    • Converts scanned pages into text before the LLM sees them.
    • Preserves page boundaries so you can trace extracted fields back to source pages.
  • AutoGen extraction agent

    • Uses AssistantAgent to extract structured lending fields from normalized text.
    • Returns JSON only, with a fixed schema for downstream validation.
  • Validation and policy layer

    • Checks extracted values against business rules.
    • Flags missing pages, suspicious totals, expired IDs, or mismatched names.
  • Human review handoff

    • Routes low-confidence or policy-failed cases to an underwriter.
    • Stores the model output plus source evidence for audit.
  • Audit and storage layer

    • Persists raw input hashes, extracted JSON, confidence signals, and reviewer actions.
    • Keeps data residency constraints in mind by storing documents in-region.

Implementation

1) Set up AutoGen and define the extraction contract

For lending, do not let the model invent fields. Define a strict schema first, then force the agent to emit only that shape.

import { AssistantAgent } from "@autogen/agentchat";
import { OpenAIChatCompletionClient } from "@autogen/oai";

type ExtractedBorrowerDoc = {
  documentType: "bank_statement" | "payslip" | "tax_return" | "id";
  fullName: string | null;
  documentDate: string | null;
  employerName: string | null;
  monthlyIncome: number | null;
  accountBalance: number | null;
  idNumber: string | null;
  confidence: number;
  evidence: {
    field: string;
    quote: string;
    page?: number;
  }[];
};

const client = new OpenAIChatCompletionClient({
  model: "gpt-4o-mini",
});

const extractor = new AssistantAgent({
  name: "document_extractor",
  modelClient: client,
  systemMessage: `
You extract lending document data.
Return valid JSON only.
Do not guess missing values.
Use the provided schema exactly.
If a value is not present, return null.
Always include evidence quotes from the source text.
`,
});

2) Feed normalized text into the agent and parse the result

In production you usually OCR first with Azure Document Intelligence, Textract, or Tesseract. The pattern below assumes you already have OCR text plus page markers.

import { UserProxyAgent } from "@autogen/agentchat";

const user = new UserProxyAgent({ name: "document_ingest" });

async function extractFromOcrText(ocrText: string): Promise<ExtractedBorrowerDoc> {
  const prompt = `
Extract fields from this lending document.

Document type hint: bank_statement
Source text:
${ocrText}

Return JSON with:
{
  "documentType": "...",
  "fullName": null,
  "documentDate": null,
  "employerName": null,
  "monthlyIncome": null,
  "accountBalance": null,
  "idNumber": null,
  "confidence": number,
  "evidence": [
    { "field": "...", "quote": "...", "page": number }
  ]
}
`;

  const result = await user.initiateChat(extractor, prompt);

  const content =
    typeof result === "string"
      ? result
      : (result as any).content ?? (result as any).messages?.at(-1)?.content;

  if (!content) throw new Error("No extraction output returned");

    return JSON.parse(content) as ExtractedBorrowerDoc;
}

3) Add lending-specific validation before downstream use

This is where most teams get burned. The model can extract a plausible income figure that still fails underwriting rules because it came from the wrong page or a stale statement.

function validateExtraction(doc: ExtractedBorrowerDoc) {
  const errors: string[] = [];

  if (!doc.fullName) errors.push("Missing fullName");
  if (!doc.documentDate) errors.push("Missing documentDate");
  if (doc.confidence < 0.8) errors.push("Low confidence");
  
   if (doc.documentType === "bank_statement" && doc.accountBalance == null) {
    errors.push("Missing accountBalance for bank statement");
   }

   if (doc.documentType === "payslip" && doc.monthlyIncome == null) {
    errors.push("Missing monthlyIncome for payslip");
   }

   return {
    ok: errors.length === 0,
    errors,
   };
}

async function processDocument(ocrText: string) {
   const extracted = await extractFromOcrText(ocrText);
   const validation = validateExtraction(extracted);

   return {
     extracted,
     validation,
     routeToHumanReview: !validation.ok || extracted.confidence < 0.9,
   };
}

4) Persist evidence for audit and compliance

Lending workflows need traceability. Store what was sent to the model, what came back, who reviewed it, and which rule approved or rejected it.

type AuditRecord = {
   applicationId: string;
   documentHash: string;
   extracted: ExtractedBorrowerDoc;
   validationErrors: string[];
   reviewedBy?: string;
   reviewedAt?: string;
};

async function saveAuditRecord(record: AuditRecord) {
   // Replace with your DB write
   console.log(JSON.stringify(record, null, 2));
}

Production Considerations

  • Keep data residency explicit

    • Route documents to region-specific OCR and model endpoints.
    • Do not move borrower PII across regions unless your legal team has signed off.
  • Log evidence, not just outputs

    • Store field-level quotes and page references.
  • Add guardrails before underwriting

  • Reject extraction results that violate policy thresholds:

    • expired IDs
    • inconsistent borrower names across documents
    • income outside expected ranges without evidence
    • missing mandatory pages
  • Monitor extraction quality by document type

  • Track precision and recall separately for bank statements, payslips, tax returns, and IDs.

  • A single aggregate accuracy number hides failures on one doc class that can break approvals.

Common Pitfalls

  1. Letting the agent free-form its answer

    • Fix this by forcing JSON-only output with a strict schema and rejecting anything else.
    • In lending, free-form answers create silent downstream failures.
  2. Skipping OCR/page provenance

    • Fix this by preserving page numbers in your OCR pipeline and passing them into the agent context.
    • Without provenance you cannot explain why a field was accepted during audit or dispute handling.
  3. Using one confidence threshold for every document

    • Fix this by setting different thresholds per doc type and risk tier.
    • A low-risk address proof can tolerate different uncertainty than a high-value income verification file.
  4. Ignoring compliance controls

    • Fix this by masking unnecessary PII before sending text to the agent and storing only what you need.
    • For lending systems this includes retention policies, access control, audit logs, and regional storage rules.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides