How to Build a document extraction Agent Using AutoGen in TypeScript for banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionautogentypescriptbanking

A document extraction agent in banking reads incoming PDFs, scans, and images, then pulls out structured fields like customer name, account number, invoice totals, dates, and signatures. It matters because banks process high volumes of KYC forms, loan packets, statements, and claims documents, and manual extraction is slow, inconsistent, and expensive.

Architecture

  • Document ingress

    • Receives files from S3, SharePoint, email ingestion, or a case management system.
    • Normalizes inputs into text plus page-level metadata.
  • Extraction agent

    • Uses an AutoGen AssistantAgent to convert document text into a strict JSON schema.
    • Handles field extraction, confidence notes, and missing-field flags.
  • Validation layer

    • Checks extracted values against banking rules.
    • Examples: account number length, date formats, currency normalization, mandatory KYC fields.
  • Human review queue

    • Routes low-confidence or policy-sensitive documents to an operations analyst.
    • Keeps auditability for regulated workflows.
  • Persistence layer

    • Stores extracted JSON in a database or downstream workflow system.
    • Preserves source document references for traceability.
  • Audit and observability

    • Logs prompts, model outputs, validation errors, and reviewer actions.
    • Needed for compliance review and incident reconstruction.

Implementation

1) Install dependencies and define the schema

For TypeScript with AutoGen, use the OpenAI-backed agent classes from @autogen/agentchat and keep the output shape strict. In banking workflows, loose parsing is how bad data gets into core systems.

npm install @autogen/agentchat zod
import { AssistantAgent } from "@autogen/agentchat";
import { z } from "zod";

const ExtractedDocumentSchema = z.object({
  documentType: z.enum(["kyc_form", "bank_statement", "invoice", "loan_application"]),
  customerName: z.string().optional(),
  accountNumber: z.string().optional(),
  invoiceTotal: z.string().optional(),
  currency: z.string().optional(),
  issueDate: z.string().optional(),
  confidence: z.number().min(0).max(1),
  missingFields: z.array(z.string()),
});

type ExtractedDocument = z.infer<typeof ExtractedDocumentSchema>;

2) Create the extraction agent

Use AssistantAgent with a system message that forces structured extraction. The key pattern is to make the agent return only JSON and explicitly forbid invention of missing values.

const extractor = new AssistantAgent({
  name: "document_extractor",
  systemMessage: `
You extract structured data from banking documents.
Return ONLY valid JSON matching this shape:
{
  "documentType": "kyc_form" | "bank_statement" | "invoice" | "loan_application",
  "customerName": string | null,
  "accountNumber": string | null,
  "invoiceTotal": string | null,
  "currency": string | null,
  "issueDate": string | null,
  "confidence": number,
  "missingFields": string[]
}

Rules:
- Do not guess missing values.
- If a field is not present or unreadable, set it to null.
- Keep account numbers exactly as written.
- Normalize dates to YYYY-MM-DD when possible.
- Return confidence between 0 and 1.
`,
});

3) Run extraction against OCR text and validate output

In production you usually feed OCR text into the agent rather than raw PDFs. The example below shows the full flow: send text to the agent with send, read the response from getMessages(), then validate with Zod before writing anything downstream.

async function extractFromText(documentText: string): Promise<ExtractedDocument> {
  await extractor.send({
    role: "user",
    content: `Extract fields from this banking document:\n\n${documentText}`,
  });

  const messages = extractor.getMessages();
  const lastMessage = messages[messages.length - 1];

  if (!lastMessage?.content || typeof lastMessage.content !== "string") {
    throw new Error("No textual response returned by extractor");
    }

  const parsed = JSON.parse(lastMessage.content);
  
const result = ExtractedDocumentSchema.safeParse(parsed);
if (!result.success) {
    throw new Error(`Invalid extraction payload: ${result.error.message}`);
}

return result.data;
}

That pattern is intentionally simple. In a real service you would wrap it in retry logic for transient model failures and add a separate parser for multi-part responses if your model returns tool-call style output.

4) Add banking-specific validation before persistence

Extraction is not enough. Banking systems need rule checks before data lands in case management or downstream decisioning.

function validateBankingRules(doc: ExtractedDocument): string[] {
	const errors: string[] = [];

	if (doc.documentType === "kyc_form" && !doc.customerName) {
		errors.push("customerName is required for KYC forms");
	}

	if (doc.accountNumber && !/^\d{8,20}$/.test(doc.accountNumber.replace(/\s+/g, ""))) {
		errors.push("accountNumber format is invalid");
	}

	if (doc.invoiceTotal && !/^\d+(\.\d{2})?$/.test(doc.invoiceTotal)) {
		errors.push("invoiceTotal must be numeric with two decimals");
	}

	if (doc.confidence < 0.85) {
		errors.push("confidence below review threshold");
	}

	return errors;
}

A practical routing rule looks like this:

  • If validation passes and confidence is high enough, persist automatically.
  • If confidence is low or fields are missing, send to human review.
  • If the document touches sanctions screening or identity verification, require analyst approval regardless of confidence.

Production Considerations

  • Data residency

    • Keep OCR text and extracted payloads in-region if your bank has jurisdictional restrictions.
    • Do not ship raw customer documents across regions just because the model endpoint is elsewhere.
  • Audit logging

    • Log input document IDs, prompt versions, model version, output JSON hash, validation results, and reviewer decisions.
    • Store enough context to reconstruct why a field was accepted or rejected.
  • Guardrails

    • Enforce schema validation before persistence.
    • Reject outputs that contain unsupported fields or inferred values not present in the source document.
    • Add redaction for PANs, SSNs/NINs, passport numbers before sending content to any model if policy requires it.
  • Operational monitoring


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides