How to Build a document extraction Agent Using CrewAI in TypeScript for lending
A document extraction agent for lending reads borrower documents, pulls out the fields your underwriting system needs, and turns unstructured PDFs into structured loan data. It matters because lending ops teams spend a lot of time chasing pay stubs, bank statements, IDs, tax returns, and employment letters; every manual touch slows approval times and increases error rates.
Architecture
- •
Document intake layer
- •Accepts PDFs, images, and email attachments from your LOS or case management system.
- •Stores raw files in a controlled bucket with retention rules.
- •
OCR and text normalization
- •Converts scanned documents into text.
- •Handles rotation, low-quality scans, and multi-page statements.
- •
Extraction agent
- •Uses a CrewAI
Agentto identify document type and extract fields like income, employer name, account balances, and dates. - •Produces structured JSON that downstream systems can validate.
- •Uses a CrewAI
- •
Validation and policy checks
- •Applies lending rules like minimum income evidence, statement recency, and ID expiration checks.
- •Flags missing or inconsistent data for human review.
- •
Audit trail service
- •Records source document hashes, extracted fields, confidence scores, and reviewer overrides.
- •Supports compliance reviews and adverse action traceability.
- •
Routing layer
- •Sends clean outputs to underwriting APIs.
- •Sends low-confidence cases to a human queue.
Implementation
1) Install dependencies and set up the project
Use the TypeScript CrewAI package plus a PDF/text extraction utility. In lending systems I keep OCR outside the agent so the agent receives normalized text instead of raw files.
npm install @crew-ai/crewai zod dotenv
npm install pdf-parse
npm install -D typescript ts-node @types/node
Create a simple environment file:
OPENAI_API_KEY=your_key
2) Define the extraction schema and the CrewAI agents
The key pattern is: one agent extracts, another validates. That keeps your underwriting logic separate from your parsing logic.
import 'dotenv/config';
import fs from 'node:fs';
import pdf from 'pdf-parse';
import { z } from 'zod';
import { Agent, Task, Crew } from '@crew-ai/crewai';
const LoanDocSchema = z.object({
documentType: z.enum(['paystub', 'bank_statement', 'w2', 'tax_return', 'id']),
borrowerName: z.string().optional(),
employerName: z.string().optional(),
statementDate: z.string().optional(),
grossIncome: z.number().optional(),
netIncome: z.number().optional(),
accountBalance: z.number().optional(),
idNumberMasked: z.string().optional(),
confidence: z.number().min(0).max(1),
flags: z.array(z.string()).default([]),
});
type LoanDoc = z.infer<typeof LoanDocSchema>;
const extractor = new Agent({
role: 'Loan Document Extractor',
goal: 'Extract structured lending fields from borrower documents with high accuracy',
backstory:
'You process income, identity, and asset documents for lending workflows. You return only fields supported by the source text.',
});
const validator = new Agent({
role: 'Lending Policy Validator',
goal: 'Check extracted data against lending rules and mark exceptions',
});
3) Build tasks that extract first, then validate
This is the part most teams get wrong. They ask one model to both parse the document and decide policy outcomes; that makes debugging painful. Split those responsibilities.
async function readPdfText(filePath: string): Promise<string> {
const buffer = fs.readFileSync(filePath);
const data = await pdf(buffer);
return data.text;
}
async function runExtraction(filePath: string): Promise<LoanDoc> {
const text = await readPdfText(filePath);
const extractionTask = new Task({
description: `
Extract lending-relevant fields from this document text.
Return valid JSON matching this schema:
${LoanDocSchema.toString()}
Document text:
${text}
`,
expectedOutput: 'Structured JSON with documentType and confidence',
agent: extractor,
outputFile: undefined,
asyncExecution: false,
callback: async (result) => result,
});
const validationTask = new Task({
description: `
Validate the extracted JSON for lending use.
Rules:
- paystubs must include either grossIncome or netIncome
- bank statements must include accountBalance or transaction evidence
- IDs must have masked idNumberMasked only; never full ID numbers
- confidence below 0.8 must be flagged for human review
`,
expectedOutput: 'Validated JSON with flags',
agent: validator,
context: [extractionTask],
asyncExecution: false,
callback: async (result) => result,
});
const crew = new Crew({
agents: [extractor, validator],
tasks: [extractionTask, validationTask],
verbose: true,
process: 'sequential',
});
const result = await crew.kickoff();
const parsed = JSON.parse(String(result));
return LoanDocSchema.parse(parsed);
}
4) Add a lending-safe entry point
In production you need guardrails before anything reaches underwriting. Reject unsupported file types early, redact sensitive values where required, and persist an audit record.
async function main() {
const filePath = process.argv[2];
if (!filePath) throw new Error('Usage: ts-node app.ts <pdf-file>');
const doc = await runExtraction(filePath);
if (doc.confidence < 0.8 || doc.flags.length > 0) {
console.log(JSON.stringify({
status: 'review_required',
documentType: doc.documentType,
flags: doc.flags,
confidence: doc.confidence,
}, null, 2));
return;
}
console.log(JSON.stringify({
status: 'approved_for_underwriting',
data: doc,
}, null, 2));
}
main().catch((err) => {
console.error(err);
process.exit(1);
});
Production Considerations
- •
Data residency
- •Keep raw borrower documents in-region if your lending program has jurisdictional storage requirements.
- •If you use hosted models, verify where prompts and outputs are processed and retained.
- •
Auditability
- •Store document hashes, extraction timestamps, model version, prompt version, confidence score, and reviewer actions.
- •
Compliance controls
- •Redact SSNs, full account numbers, and full ID numbers before logging.
- •
Operational monitoring
- •Track extraction accuracy by document type.
- •Alert on spikes in low-confidence extractions or fallback-to-human rates.
Common Pitfalls
- •
Trying to do OCR inside the agent
Don’t feed raw scans directly to the LLM unless you have no choice. Use OCR first so the agent works on normalized text and your results are reproducible.
- •
Mixing extraction with underwriting decisions
Keep field extraction separate from credit policy decisions. The agent should say “this paystub shows gross income of X,” not “approve this loan.”
- •
Logging sensitive payloads
Never dump full documents or unmasked identifiers into application logs. Use structured logs with redaction and keep full payloads in restricted storage only.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit