How to Build a document extraction Agent Using CrewAI in TypeScript for investment banking

By Cyprian AaronsUpdated 2026-04-21
document-extractioncrewaitypescriptinvestment-banking

A document extraction agent for investment banking reads deal documents, pulls out structured fields, and hands them to downstream systems with traceability. That matters because bankers spend too much time copying data from CIMs, term sheets, pitch decks, and credit memos into models and trackers, and mistakes here become compliance issues, pricing errors, or bad IC materials.

Architecture

Build this agent as a pipeline, not a single prompt.

  • Document intake layer

    • Accept PDFs, DOCX files, and scanned images from controlled sources like SharePoint, S3, or an internal DMS.
    • Enforce file provenance and document IDs before any extraction starts.
  • Text extraction layer

    • Use OCR for scanned pages and native text extraction for digital documents.
    • Preserve page numbers, section headings, and table boundaries because bankers need citations.
  • CrewAI agent layer

    • Use a focused extraction agent with a strict schema.
    • Add a validation agent that checks completeness, confidence, and policy violations.
  • Normalization layer

    • Map extracted values into a canonical deal schema.
    • Normalize currencies, dates, company names, covenant ratios, and jurisdiction fields.
  • Audit and storage layer

    • Store raw text, extracted JSON, confidence scores, and source page references.
    • Keep immutable logs for review by compliance or operations.

Implementation

1) Install the TypeScript stack

Use CrewAI’s TypeScript SDK with an LLM provider and a document parser/OCR library. The exact parser depends on your input format; the agent code below assumes you already have extracted text plus metadata.

npm install @crewai/crewai zod dotenv

Set your environment variables:

CREWAI_API_KEY=...
OPENAI_API_KEY=...

2) Define the extraction schema

For investment banking, do not ask for free-form output. Force the model into a typed structure so downstream systems can validate it.

import { z } from "zod";

export const DealExtractionSchema = z.object({
  dealName: z.string(),
  targetCompany: z.string(),
  sponsor: z.string().optional(),
  transactionType: z.enum(["M&A", "LBO", "IPO", "Debt Financing", "Refinancing", "Other"]),
  currency: z.string(),
  enterpriseValue: z.number().optional(),
  equityValue: z.number().optional(),
  closingDate: z.string().optional(),
  jurisdictions: z.array(z.string()).default([]),
  keyTerms: z.array(
    z.object({
      label: z.string(),
      value: z.string(),
      sourcePage: z.number().optional()
    })
  ).default([]),
  citations: z.array(
    z.object({
      field: z.string(),
      page: z.number(),
      quote: z.string()
    })
  ).default([])
});

export type DealExtraction = z.infer<typeof DealExtractionSchema>;

3) Build the CrewAI agents and task

Use one agent to extract facts and another to verify them. In banking workflows this separation matters because it creates a reviewable control point.

import "dotenv/config";
import { Agent, Task, Crew } from "@crewai/crewai";

const extractor = new Agent({
  role: "Investment Banking Document Extractor",
  goal:
    "Extract structured deal data from banking documents with page-level citations and no invented values.",
  backstory:
    "You work on M&A and financing teams. You only return facts supported by the source text.",
});

const verifier = new Agent({
  role: "Investment Banking Data Validator",
  goal:
    "Check extracted deal data for completeness, consistency, and compliance issues.",
  backstory:
    "You validate against internal controls. You flag missing citations, ambiguous terms, and policy risks.",
});

const extractTask = new Task({
  description: `
Extract the following fields from the provided investment banking document:
dealName, targetCompany, sponsor, transactionType,
currency, enterpriseValue, equityValue, closingDate,
jurisdictions, keyTerms, citations.

Rules:
- Every non-obvious field must have a citation.
- Use only information present in the document.
- If a field is missing, return null or an empty array.
- Preserve page numbers for every citation.
`,
  expectedOutput: "Valid JSON matching the DealExtractionSchema.",
  agent: extractor,
});

const validateTask = new Task({
  description:
    "Review the extracted JSON for completeness and control issues. Return only corrections or flags.",
  expectedOutput:
    "A list of validation issues with severity and suggested fix.",
  agent: verifier,
});

4) Run the crew and enforce schema validation

This is the pattern you want in production: generate output from the crew, parse it through Zod, then reject anything that does not pass.

async function main() {
  const crew = new Crew({
    agents: [extractor, verifier],
    tasks: [extractTask, validateTask],
    verbose: true,
  });

  const result = await crew.kickoff({
    inputs: {
      documentText: `
        Page 1
        Confidential Information Memorandum
        Target Company: Northbridge Logistics Ltd.
        Transaction Type: Sale process
        Enterprise Value: USD $420 million
        Expected Closing Date: Q4 2025
        Page 2
        Key Terms include change-of-control consent and minimum liquidity covenant.
      `,
      documentId: "doc_2025_001",
      sourceSystem: "sharepoint",
      region: "us-east-1",
    },
  });

  const parsed = DealExtractionSchema.safeParse(JSON.parse(String(result)));
  
}
main();

In practice you will wrap JSON.parse in a stricter response handler because LLMs sometimes emit extra text. The important part is that CrewAI handles orchestration while Zod enforces contract validity before anything hits your deal database.

Production Considerations

  • Compliance controls

    • Log every input document ID, user requestor ID, model version, prompt version, and output hash.
    • Keep an audit trail that legal/compliance can replay during reviews or disputes.
  • Data residency

    • Pin processing to approved regions such as us-east-1 or an EU region depending on desk policy.
    • Do not send client-confidential materials across borders without explicit approval.
  • Monitoring

    • Track extraction accuracy by field type; EV/date/company name errors are not equal.
    • Alert on low-confidence outputs, missing citations, or sudden drift after model upgrades.
  • Guardrails

ControlWhy it mattersImplementation
Schema validationPrevents malformed JSON entering downstream systemsZod or equivalent runtime validation
Citation requirementSupports auditabilityReject fields without page references
Human review thresholdReduces risk on material termsRoute high-value deals to analyst approval
PII/redaction filtersProtects sensitive client dataMask personal data before logging

Common Pitfalls

  • Using one prompt for everything

Bad pattern. Extraction of deal terms is not the same as risk review. Split extraction from validation so failures are easier to isolate.

  • Ignoring page-level provenance

If you cannot show where enterpriseValue came from on page X of the CIM or term sheet, your output is not production-grade. Always store citations alongside structured fields.

  • Letting free-form output into downstream systems

Never write raw LLM text directly into CRM or deal trackers. Parse it into a typed schema first and reject anything that fails validation.

A good investment banking extraction agent is boring in the right ways. It is deterministic at the boundaries, auditable end-to-end, and strict about what it will accept as truth.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides