How to Build a document extraction Agent Using CrewAI in TypeScript for banking

By Cyprian AaronsUpdated 2026-04-21

document-extractioncrewaitypescriptbanking

A document extraction agent for banking reads incoming PDFs, scans, statements, KYC forms, and loan packets, then turns them into structured data your downstream systems can trust. It matters because banks lose time and introduce risk when humans manually key in data that should be extracted, validated, and audited automatically.

Architecture

•
Document ingestion layer
- •Accepts PDFs, images, or text files from S3, SharePoint, SFTP, or internal case-management systems.
- •Normalizes file metadata like customer ID, document type, and source system.
•
Extraction agent
- •Uses CrewAI Agent with a strict role: extract fields only, do not infer missing values.
- •Produces structured JSON for account numbers, names, addresses, dates, amounts, and signatures.
•
Task orchestration
- •Uses CrewAI Task objects to split work into classification, extraction, validation, and escalation.
- •Keeps each step auditable and deterministic.
•
Validation layer
- •Checks extracted values against banking rules: IBAN format, routing number length, date consistency, currency normalization.
- •Flags low-confidence fields for human review.
•
Audit and persistence
- •Stores raw document hashes, extracted output, model version, prompt version, and reviewer actions.
- •Supports regulatory review and internal controls.
•
Security boundary
- •Redacts PII where possible.
- •Enforces data residency and tenant isolation before any model call.

Implementation

1) Install the TypeScript dependencies

Use the official CrewAI JS/TS package plus a schema validator. For banking workflows, keep validation outside the model.

npm install crewai zod dotenv
npm install -D typescript tsx @types/node

Set your environment variables:

CREWAI_API_KEY=your_key_here
OPENAI_API_KEY=your_llm_key_here

2) Define the extraction schema and agents

The key pattern is: one agent extracts, another validates. Do not ask one model call to do both if you need auditability.

import "dotenv/config";
import { z } from "zod";
import { Agent, Task, Crew } from "crewai";

const BankDocumentSchema = z.object({
  documentType: z.enum(["bank_statement", "kyc_form", "loan_application", "utility_bill"]),
  customerName: z.string().optional(),
  accountNumber: z.string().optional(),
  iban: z.string().optional(),
  routingNumber: z.string().optional(),
  address: z.string().optional(),
  statementDate: z.string().optional(),
  totalAmount: z.number().optional(),
  currency: z.string().optional(),
  confidenceNotes: z.array(z.string()).default([]),
});

type BankDocument = z.infer<typeof BankDocumentSchema>;

const extractor = new Agent({
  role: "Bank Document Extraction Specialist",
  goal: "Extract structured banking fields from documents without guessing missing values.",
  backstory:
    "You extract only explicitly present fields from bank documents. You never invent data.",
});

const validator = new Agent({
  role: "Bank Data Validator",
  goal: "Validate extracted fields against banking rules and flag issues for human review.",
  backstory:
    "You check format correctness, completeness, and compliance risks. You do not modify source facts.",
});

3) Create tasks for extraction and validation

CrewAI’s task boundary is useful in regulated workflows because it gives you a clear trail of what was asked at each step.

const extractTask = new Task({
  description: `
Extract the following from the provided bank document text:
- documentType
- customerName
- accountNumber
- iban
- routingNumber
- address
- statementDate
- totalAmount
- currency

Rules:
- Only extract values explicitly present in the text.
- If a field is missing or unclear, omit it.
- Return JSON only.
`,
  expectedOutput: "A JSON object matching the bank document schema.",
  agent: extractor,
});

const validateTask = new Task({
  description: `
Validate the extracted JSON:
- Check required banking formats where present.
- Flag suspicious or incomplete fields.
- Add notes for manual review if confidence is low or data conflicts exist.
Return JSON only.
`,
  expectedOutput: "Validated JSON with confidence notes.",
  agent: validator,
});

4) Run the crew and validate output before persistence

This is the production pattern I use most often: model output is treated as untrusted input until it passes schema validation.

async function runExtraction(documentText: string): Promise<BankDocument> {
  const crew = new Crew({
    agents: [extractor, validator],
    tasks: [extractTask, validateTask],
    verbose: true,
    process: "sequential",
  });

  const result = await crew.kickoff({
    inputs: {
      documentText,
      jurisdiction: "US",
      dataResidency: "us-east-1",
    },
  });

  
   const parsed = typeof result === "string" ? JSON.parse(result) : result;
   return BankDocumentSchema.parse(parsed);
}

const sampleDoc = `
Customer Name: Jane Doe
Account Number: 123456789
Routing Number: 021000021
Statement Date: 2024-10-31
Currency: USD
Total Amount Due: $1,245.88
`;

runExtraction(sampleDoc)
.then((data) => {
   console.log("Validated extraction:", data);
})
.catch((err) => {
   console.error("Extraction failed validation:", err);
});

Production Considerations

•
Deployment
- •Keep the agent behind a private service boundary in your VPC.
- •Route documents to region-specific workers to satisfy data residency requirements.
- •Do not send raw documents across regions just to simplify ops.
•
Monitoring
- •Log prompt version, model version, task IDs, confidence notes, and final reviewer decisions.
- •Track field-level accuracy by document type.
- •Alert on spikes in missing account numbers or invalid IBANs; that usually means template drift or OCR failure.
•
Guardrails
- •
  Block free-form generation by forcing JSON-only outputs and schema validation.
- •
  Redact SSNs, PANs, passport numbers before storage if they are not needed downstream.
- •
  Use human-in-the-loop review for low-confidence extractions or any KYC-related mismatch.
•
Compliance
- •Store immutable audit records for every extraction run.
- •Retain source document hashes so investigators can prove what was processed.
- •Make sure your retention policy matches AML/KYC obligations and local privacy law.

Common Pitfalls

•
Letting the model infer missing values
- •If an account number is partially visible or a date is ambiguous, do not guess.
- •Fix this by enforcing “extract only what is explicit” in the task prompt and rejecting inferred fields in Zod validation.
•
Skipping schema validation
- •Raw LLM output will eventually break your downstream pipeline.
- •Always parse into a strict TypeScript schema before saving anything to your case system or core banking integration.
•
Ignoring auditability
- •Banks need traceability on who processed what and when.
- •Persist the original input hash, task outputs, reviewer actions, and model metadata alongside the final extracted record.
•
Mixing jurisdictions without controls
- •A document uploaded in one region should not silently get processed elsewhere.
- •Pin execution to approved regions and keep residency policy checks outside the agent itself.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit