How to Build a document extraction Agent Using AutoGen in TypeScript for investment banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionautogentypescriptinvestment-banking

A document extraction agent in investment banking takes messy deal documents — pitch decks, CIMs, credit agreements, term sheets, diligence packs — and turns them into structured fields you can route into downstream systems. That matters because bankers spend too much time retyping data, and mistakes in extracted covenants, dates, parties, or financials create real operational and compliance risk.

Architecture

A production agent for this use case needs a narrow, controlled shape:

  • Document ingestion layer

    • Accept PDF, DOCX, and image-based scans.
    • Normalize files before sending them to the model.
  • OCR / text extraction service

    • Use OCR for scanned pages.
    • Preserve page numbers and bounding context for auditability.
  • AutoGen extraction agent

    • An AssistantAgent that converts raw text into a strict JSON schema.
    • Keep the prompt focused on extraction only.
  • Validation and reconciliation layer

    • Validate output against a TypeScript schema.
    • Compare extracted values against source snippets for traceability.
  • Human review queue

    • Route low-confidence or high-risk fields to analysts.
    • Never auto-post material terms without review.
  • Audit log and storage

    • Persist inputs, outputs, prompts, model version, timestamps, and reviewer actions.
    • Required for compliance and post-trade / deal review evidence.

Implementation

1) Install AutoGen and define the extraction schema

Use AutoGen’s TypeScript packages plus a validator like Zod. For investment banking, keep the output schema tight: parties, dates, amounts, covenants, governing law, and source references.

npm install @autogenai/agent zod
import { z } from "zod";

export const ExtractionSchema = z.object({
  documentType: z.string(),
  counterparty: z.string().nullable(),
  effectiveDate: z.string().nullable(),
  maturityDate: z.string().nullable(),
  currency: z.string().nullable(),
  principalAmount: z.number().nullable(),
  governingLaw: z.string().nullable(),
  covenants: z.array(z.string()),
  sourceReferences: z.array(
    z.object({
      field: z.string(),
      page: z.number(),
      quote: z.string()
    })
  )
});

export type ExtractionResult = z.infer<typeof ExtractionSchema>;

2) Build the AutoGen assistant with a strict system prompt

The key pattern is to make the agent extract only from provided text. Don’t let it infer missing values. In banking workflows, hallucinated legal or financial terms are unacceptable.

import { AssistantAgent } from "@autogenai/agent";

export const extractor = new AssistantAgent({
  name: "document_extractor",
  systemMessage: [
    "You extract structured data from investment banking documents.",
    "Return only valid JSON matching the requested schema.",
    "Do not infer missing values.",
    "If a field is absent or unclear, use null.",
    "Every non-null field must include a source reference with page number and exact quote."
  ].join(" ")
});

3) Send the document text to the agent and validate the result

This example assumes you already ran OCR or text extraction upstream. The important part is that AssistantAgent.send gets a single task with explicit instructions and source text. Then validate the returned JSON before it enters any workflow.

import { ExtractionSchema } from "./schema";
import { extractor } from "./agent";

async function extractFromDocument(documentText: string) {
  const task = `
Extract deal metadata from the following investment banking document.

Return JSON with:
- documentType
- counterparty
- effectiveDate
- maturityDate
- currency
- principalAmount
- governingLaw
- covenants
- sourceReferences

Document text:
${documentText}
`;

  const response = await extractor.send(task);

  const content =
    typeof response === "string"
      ? response
      : response?.content ?? JSON.stringify(response);

  const parsed = JSON.parse(content);
  return ExtractionSchema.parse(parsed);
}

4) Add routing for human review on risky fields

In practice you do not want every extraction to go straight through. If the agent finds a principal amount but no source reference, or if legal language is ambiguous, route it to a reviewer. That gives you control over compliance-sensitive fields.

function needsReview(result: ExtractionResult): boolean {
  return (
    result.principalAmount === null ||
    result.governingLaw === null ||
    result.sourceReferences.length < Object.values(result).filter(Boolean).length / 2
  );
}

async function processDocument(documentText: string) {
  const result = await extractFromDocument(documentText);

  if (needsReview(result)) {
    return {
      status: "needs_review",
      payload: result
    };
  }

  return {
    status: "approved",
    payload: result
  };
}

Production Considerations

  • Keep data residency explicit

    • If documents contain MNPI or client-confidential data, pin processing to approved regions.
    • Do not send deal docs to unmanaged endpoints or consumer-grade model APIs.
  • Log everything needed for audit

    • Store prompt version, model name, document hash, extracted output, reviewer decision, and timestamps.
    • In banking, you need reproducibility when Legal or Compliance asks why a field was populated.
  • Use confidence gating by field type

    • Treat covenant extraction differently from descriptive metadata.
    • High-risk fields like leverage ratios, maturity dates, change-of-control clauses should always have review thresholds.
  • Redact before downstream distribution

    • Only expose extracted fields required by the next system.
    • Strip client names or confidential terms from analytics pipelines unless there is a clear business need.

Common Pitfalls

  1. Letting the model infer missing facts

    • Bad pattern: “Fill in likely maturity date if not stated.”
    • Fix: force null for absent fields and require exact quotes for every populated value.
  2. Skipping source references

    • Without page-level evidence you cannot defend an extraction in an audit or client dispute.
    • Fix: make sourceReferences mandatory for every non-null field.
  3. Treating all documents as equal

    • A teaser is not a credit agreement. A scanned term sheet is not a clean PDF.
    • Fix: classify document type first and apply different prompts, validators, and review rules per document class.
  4. Ignoring operational controls

    • An extraction agent that works in dev but has no logging, region control, or access boundaries will fail in production.
    • Fix: wrap AutoGen behind your own service layer with authz checks, encrypted storage, retention policies, and human approval gates where required.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides