How to Build a document extraction Agent Using CrewAI in TypeScript for pension funds
A document extraction agent for pension funds reads statements, contribution schedules, beneficiary forms, trustee packs, and regulatory filings, then turns them into structured data your downstream systems can trust. It matters because pension operations are document-heavy, highly regulated, and expensive to process manually; the agent reduces turnaround time while keeping an audit trail for compliance and review.
Architecture
- •
Document intake layer
- •Pulls files from S3, SharePoint, SFTP, or an internal DMS.
- •Normalizes PDFs, scans, DOCX, and email attachments into a single input format.
- •
OCR and text extraction
- •Uses OCR for scanned pension documents.
- •Preserves page numbers and bounding context for traceability.
- •
CrewAI agent layer
- •One agent extracts fields.
- •A second agent validates against pension-specific rules like member ID format, contribution totals, and date ranges.
- •
Structured output schema
- •Converts extracted content into JSON with fixed fields such as member name, scheme number, employer contribution, employee contribution, effective date, and document type.
- •
Human review queue
- •Routes low-confidence or policy-sensitive documents to operations staff.
- •Keeps exceptions out of straight-through processing.
- •
Audit and storage
- •Stores source document hash, extraction result, validation notes, model version, and reviewer actions.
- •Supports regulator-ready traceability.
Implementation
1) Install CrewAI for TypeScript and set up the project
Use the TypeScript SDK and a parser for local file ingestion. Keep the extraction logic separate from transport so you can swap S3 or SharePoint later without rewriting the agent.
npm install @crewai/crewai zod dotenv
npm install pdf-parse
Create a minimal environment file with your model credentials and any internal routing settings.
OPENAI_API_KEY=your_key
2) Define the extraction schema for pension documents
For pension funds, schema design matters more than prompt wording. If you do not lock down fields up front, you will end up with inconsistent outputs that are hard to reconcile in operations.
import { z } from "zod";
export const PensionExtractionSchema = z.object({
documentType: z.enum([
"member_statement",
"contribution_schedule",
"beneficiary_form",
"trustee_pack",
"regulatory_filing",
"unknown"
]),
schemeName: z.string().min(1),
memberId: z.string().optional(),
employerName: z.string().optional(),
memberName: z.string().optional(),
effectiveDate: z.string().optional(),
currency: z.string().optional(),
employeeContribution: z.number().optional(),
employerContribution: z.number().optional(),
totalContribution: z.number().optional(),
confidence: z.number().min(0).max(1),
notes: z.array(z.string()).default([]),
});
3) Build the CrewAI agent and task in TypeScript
This pattern uses a dedicated extractor agent plus a validation task. The extractor focuses on reading the document; the validator checks pension rules before anything lands in your core admin system.
import "dotenv/config";
import fs from "node:fs/promises";
import pdfParse from "pdf-parse";
import { Agent, Task, Crew } from "@crewai/crewai";
import { PensionExtractionSchema } from "./schema";
async function loadPdfText(path: string) {
const buffer = await fs.readFile(path);
const parsed = await pdfParse(buffer);
return parsed.text;
}
async function main() {
const text = await loadPdfText("./input/pension-statement.pdf");
const extractor = new Agent({
role: "Pension Document Extractor",
goal: "Extract structured fields from pension documents with high fidelity",
backstory:
"You work on regulated pension operations. Preserve exact values and flag ambiguity.",
verbose: true,
allowDelegation: false,
memory: false,
llm: "gpt-4o-mini",
});
const task = new Task({
description: `
Extract structured data from this pension document.
Return only data that matches the schema.
Flag missing or ambiguous values in notes.
Document text:
${text}
`,
expectedOutput:
"A JSON object matching the pension extraction schema with confidence score.",
agent: extractor,
outputJsonSchema: PensionExtractionSchema,
});
const crew = new Crew({
agents: [extractor],
tasks: [task],
verbose: true,
process: "sequential",
});
const result = await crew.kickoff();
console.log(JSON.stringify(result.raw ?? result.output ?? result, null, 2));
}
main();
4) Add a validation pass for compliance controls
This is where pension-specific logic belongs. Do not rely on the model alone to decide whether a contribution schedule is valid or whether a beneficiary form is complete enough to process.
function validatePensionRecord(record: any) {
const issues: string[] = [];
if (!record.schemeName) issues.push("Missing scheme name");
if (record.confidence < 0.85) issues.push("Low extraction confidence");
if (
record.employeeContribution !== undefined &&
record.employerContribution !== undefined &&
record.totalContribution !== undefined
) {
const sum = Number(record.employeeContribution) + Number(record.employerContribution);
if (Math.abs(sum - Number(record.totalContribution)) > 0.01) {
issues.push("Contribution total does not match component values");
}
}
return {
approvedForStraightThroughProcessing: issues.length === 0,
issues,
};
}
In production, send rejected records to a human queue with the original PDF, extracted JSON, model version, timestamp, and validation errors attached.
Production Considerations
- •
Data residency
- •Keep documents and model calls inside approved regions.
- •For pension funds handling member PII, avoid sending raw files across jurisdictions unless legal review has signed off.
- •
Auditability
- •Store document hashes, prompts or task descriptions used at runtime, output payloads, reviewer decisions, and model identifiers.
- •Regulators care about how you got an answer as much as the answer itself.
- •
Guardrails
- •Enforce schema validation before persistence.
- •Use deterministic post-processing for dates, currency normalization, and contribution math instead of asking the model to “be accurate.”
- •
Monitoring
- •Track extraction accuracy by document type.
- •Watch for drift on recurring forms like annual statements or trustee packs after template changes.
Common Pitfalls
- •
Treating OCR output as final truth
Scanned pension documents often produce broken lines and merged columns. Always preserve page context and keep a link back to source text so reviewers can verify suspicious fields quickly.
- •
Skipping validation because the model returned JSON
Valid JSON is not valid business data. Check required fields, numeric totals, date formats, and scheme-specific rules before writing into your admin platform.
- •
Ignoring exception handling for incomplete forms
Beneficiary nominations and transfer requests are often partially filled. Route missing signatures, mismatched IDs, or low-confidence extractions into manual review instead of auto-processing them.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit