How to Build a document extraction Agent Using AutoGen in TypeScript for fintech

By Cyprian AaronsUpdated 2026-04-21

document-extractionautogentypescriptfintech

A document extraction agent takes PDFs, scans, bank statements, invoices, KYC forms, and trade confirmations, then turns them into structured JSON your downstream systems can trust. In fintech, that matters because manual extraction is slow, expensive, and error-prone, and every bad field can turn into a compliance issue, a rejected payment, or a broken onboarding flow.

Architecture

•
Document ingress
- •Accept PDFs, images, or text from object storage, SFTP drops, or an upload API.
- •Keep the raw document immutable for audit and replay.
•
OCR / text normalization
- •Convert scans to machine-readable text before the LLM sees it.
- •Preserve page numbers and bounding context when available.
•
Extraction agent
- •Use an AutoGen AssistantAgent to extract fields into a strict schema.
- •Force structured output so the model does not free-form its way through the task.
•
Validation layer
- •Run deterministic checks on dates, currency formats, account numbers, totals, and required fields.
- •Reject or flag low-confidence results for human review.
•
Audit and trace store
- •Persist prompts, model outputs, document hashes, and validation results.
- •This is non-negotiable in regulated workflows.
•
Human review queue
- •Route exceptions to ops or compliance when confidence drops below threshold.
- •Keep the reviewer’s correction as training data for later prompt refinement.

Implementation

1) Set up AutoGen and define the extraction contract

In TypeScript, use AutoGen’s AssistantAgent with a schema-first prompt. For fintech work, do not ask for “best effort” extraction; define the exact fields you need and what to do when data is missing.

import { AssistantAgent } from "@microsoft/autogen";
import * as fs from "node:fs";

type ExtractedDocument = {
  documentType: "bank_statement" | "invoice" | "kyc_form" | "trade_confirmation";
  entityName: string;
  issueDate: string;
  currency: string;
  totalAmount?: number;
  accountNumberLast4?: string;
  iban?: string;
  confidence: number;
};

const extractor = new AssistantAgent({
  name: "document_extractor",
  systemMessage: [
    "You extract structured data from financial documents.",
    "Return only valid JSON matching the requested schema.",
    "If a field is missing, use null.",
    "Do not invent values.",
    "Preserve dates in ISO-8601 format."
  ].join(" "),
});

2) Load the document text and call the agent

If you already have OCR output from Textract, Azure Document Intelligence, or Tesseract, feed that text directly. If not, keep OCR outside the LLM layer so your pipeline stays deterministic.

async function extractFromText(documentText: string): Promise<ExtractedDocument> {
  const prompt = `
Extract these fields from the document:
- documentType
- entityName
- issueDate
- currency
- totalAmount
- accountNumberLast4
- iban
- confidence

Document:
${documentText}
`;

  const result = await extractor.generateReply([{ role: "user", content: prompt }]);
  const content = typeof result === "string" ? result : result.content;

  return JSON.parse(content) as ExtractedDocument;
}

async function main() {
  const docText = fs.readFileSync("./sample-document.txt", "utf8");
  const extracted = await extractFromText(docText);
  console.log(extracted);
}

main().catch(console.error);

This pattern works because AutoGen handles the conversational interface while your code keeps control over input/output boundaries. In production, wrap JSON.parse in a validator like zod so malformed responses fail closed.

3) Add validation and exception routing

Fintech extraction is not done when the model returns JSON. It is done when the data passes business rules.

function validateExtraction(doc: ExtractedDocument): string[] {
  const errors: string[] = [];

  if (!doc.entityName) errors.push("entityName is required");
  if (!doc.issueDate || Number.isNaN(Date.parse(doc.issueDate))) {
    errors.push("issueDate must be ISO date");
  }
  if (!doc.currency || doc.currency.length !== 3) {
    errors.push("currency must be ISO-4217 code");
  }
  if (doc.confidence < 0.85) errors.push("confidence below threshold");

  return errors;
}

async function processDocument(text: string) {
    const extracted = await extractFromText(text);
    const errors = validateExtraction(extracted);

    if (errors.length > 0) {
      return {
        status: "needs_review",
        extracted,
        errors,
      };
    }

    return {
      status: "approved",
      extracted,
    };
}

That threshold should be tuned by document type. A KYC form can tolerate different uncertainty than a payment instruction or sanctions-related record.

4) Store traces for auditability

For regulated workflows, persist enough metadata to reconstruct every decision. At minimum store a hash of the source file, model name/version, prompt version, extracted payload, validation outcome, and reviewer override if one exists.

import crypto from "node:crypto";

function sha256(input: Buffer | string): string {
  return crypto.createHash("sha256").update(input).digest("hex");
}

type AuditRecord = {
  documentHash: string;
  promptVersion: string;
};

Keep raw documents in region-bound storage that matches your residency requirements. If you operate in multiple jurisdictions, partition by tenant and region before any LLM call happens.

Production Considerations

•
Deployment
- •Run OCR and extraction as separate services so failures are isolated.
- •Keep tenant-specific queues if you handle multiple banks or legal entities.
•
Monitoring
- •Track parse failure rate, human-review rate, field-level accuracy by document type, and latency per page.
- •Alert on sudden shifts; they often mean template drift or upstream OCR regression.
•
Guardrails
- •Enforce schema validation before writing to core systems.
- •Block unsupported document types instead of letting the model guess.
- •Redact PII where possible before logs or analytics capture it.
•
Compliance and residency
- •Pin workloads to approved regions.
- •Log access to source docs and outputs.
- •Retain artifacts according to policy; delete when retention expires.

Common Pitfalls

•
Letting the model “interpret” instead of extract
- •Fix this by using strict schemas and rejecting invented values.
- •If a field is absent in the source doc, return null.
•
Skipping deterministic validation
- •The LLM should not decide whether an IBAN format is valid or whether totals reconcile.
- •Put those checks in code every time.
•
Ignoring audit requirements
- •If you cannot reproduce why a field was accepted or rejected, you do not have a production fintech workflow.
- •Store prompts, outputs, hashes, timestamps, and reviewer actions with each run.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit