How to Build a document extraction Agent Using CrewAI in TypeScript for investment banking
A document extraction agent for investment banking reads deal documents, pulls out structured fields, and hands them to downstream systems with traceability. That matters because bankers spend too much time copying data from CIMs, term sheets, pitch decks, and credit memos into models and trackers, and mistakes here become compliance issues, pricing errors, or bad IC materials.
Architecture
Build this agent as a pipeline, not a single prompt.
- •
Document intake layer
- •Accept PDFs, DOCX files, and scanned images from controlled sources like SharePoint, S3, or an internal DMS.
- •Enforce file provenance and document IDs before any extraction starts.
- •
Text extraction layer
- •Use OCR for scanned pages and native text extraction for digital documents.
- •Preserve page numbers, section headings, and table boundaries because bankers need citations.
- •
CrewAI agent layer
- •Use a focused extraction agent with a strict schema.
- •Add a validation agent that checks completeness, confidence, and policy violations.
- •
Normalization layer
- •Map extracted values into a canonical deal schema.
- •Normalize currencies, dates, company names, covenant ratios, and jurisdiction fields.
- •
Audit and storage layer
- •Store raw text, extracted JSON, confidence scores, and source page references.
- •Keep immutable logs for review by compliance or operations.
Implementation
1) Install the TypeScript stack
Use CrewAI’s TypeScript SDK with an LLM provider and a document parser/OCR library. The exact parser depends on your input format; the agent code below assumes you already have extracted text plus metadata.
npm install @crewai/crewai zod dotenv
Set your environment variables:
CREWAI_API_KEY=...
OPENAI_API_KEY=...
2) Define the extraction schema
For investment banking, do not ask for free-form output. Force the model into a typed structure so downstream systems can validate it.
import { z } from "zod";
export const DealExtractionSchema = z.object({
dealName: z.string(),
targetCompany: z.string(),
sponsor: z.string().optional(),
transactionType: z.enum(["M&A", "LBO", "IPO", "Debt Financing", "Refinancing", "Other"]),
currency: z.string(),
enterpriseValue: z.number().optional(),
equityValue: z.number().optional(),
closingDate: z.string().optional(),
jurisdictions: z.array(z.string()).default([]),
keyTerms: z.array(
z.object({
label: z.string(),
value: z.string(),
sourcePage: z.number().optional()
})
).default([]),
citations: z.array(
z.object({
field: z.string(),
page: z.number(),
quote: z.string()
})
).default([])
});
export type DealExtraction = z.infer<typeof DealExtractionSchema>;
3) Build the CrewAI agents and task
Use one agent to extract facts and another to verify them. In banking workflows this separation matters because it creates a reviewable control point.
import "dotenv/config";
import { Agent, Task, Crew } from "@crewai/crewai";
const extractor = new Agent({
role: "Investment Banking Document Extractor",
goal:
"Extract structured deal data from banking documents with page-level citations and no invented values.",
backstory:
"You work on M&A and financing teams. You only return facts supported by the source text.",
});
const verifier = new Agent({
role: "Investment Banking Data Validator",
goal:
"Check extracted deal data for completeness, consistency, and compliance issues.",
backstory:
"You validate against internal controls. You flag missing citations, ambiguous terms, and policy risks.",
});
const extractTask = new Task({
description: `
Extract the following fields from the provided investment banking document:
dealName, targetCompany, sponsor, transactionType,
currency, enterpriseValue, equityValue, closingDate,
jurisdictions, keyTerms, citations.
Rules:
- Every non-obvious field must have a citation.
- Use only information present in the document.
- If a field is missing, return null or an empty array.
- Preserve page numbers for every citation.
`,
expectedOutput: "Valid JSON matching the DealExtractionSchema.",
agent: extractor,
});
const validateTask = new Task({
description:
"Review the extracted JSON for completeness and control issues. Return only corrections or flags.",
expectedOutput:
"A list of validation issues with severity and suggested fix.",
agent: verifier,
});
4) Run the crew and enforce schema validation
This is the pattern you want in production: generate output from the crew, parse it through Zod, then reject anything that does not pass.
async function main() {
const crew = new Crew({
agents: [extractor, verifier],
tasks: [extractTask, validateTask],
verbose: true,
});
const result = await crew.kickoff({
inputs: {
documentText: `
Page 1
Confidential Information Memorandum
Target Company: Northbridge Logistics Ltd.
Transaction Type: Sale process
Enterprise Value: USD $420 million
Expected Closing Date: Q4 2025
Page 2
Key Terms include change-of-control consent and minimum liquidity covenant.
`,
documentId: "doc_2025_001",
sourceSystem: "sharepoint",
region: "us-east-1",
},
});
const parsed = DealExtractionSchema.safeParse(JSON.parse(String(result)));
}
main();
In practice you will wrap JSON.parse in a stricter response handler because LLMs sometimes emit extra text. The important part is that CrewAI handles orchestration while Zod enforces contract validity before anything hits your deal database.
Production Considerations
- •
Compliance controls
- •Log every input document ID, user requestor ID, model version, prompt version, and output hash.
- •Keep an audit trail that legal/compliance can replay during reviews or disputes.
- •
Data residency
- •Pin processing to approved regions such as
us-east-1or an EU region depending on desk policy. - •Do not send client-confidential materials across borders without explicit approval.
- •Pin processing to approved regions such as
- •
Monitoring
- •Track extraction accuracy by field type; EV/date/company name errors are not equal.
- •Alert on low-confidence outputs, missing citations, or sudden drift after model upgrades.
- •
Guardrails
| Control | Why it matters | Implementation |
|---|---|---|
| Schema validation | Prevents malformed JSON entering downstream systems | Zod or equivalent runtime validation |
| Citation requirement | Supports auditability | Reject fields without page references |
| Human review threshold | Reduces risk on material terms | Route high-value deals to analyst approval |
| PII/redaction filters | Protects sensitive client data | Mask personal data before logging |
Common Pitfalls
- •Using one prompt for everything
Bad pattern. Extraction of deal terms is not the same as risk review. Split extraction from validation so failures are easier to isolate.
- •Ignoring page-level provenance
If you cannot show where enterpriseValue came from on page X of the CIM or term sheet, your output is not production-grade. Always store citations alongside structured fields.
- •Letting free-form output into downstream systems
Never write raw LLM text directly into CRM or deal trackers. Parse it into a typed schema first and reject anything that fails validation.
A good investment banking extraction agent is boring in the right ways. It is deterministic at the boundaries, auditable end-to-end, and strict about what it will accept as truth.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit