How to Build a document extraction Agent Using AutoGen in TypeScript for pension funds
A document extraction agent for pension funds reads incoming statements, contribution schedules, transfer forms, member letters, and benefit documents, then turns them into structured data your downstream systems can trust. It matters because pension operations are full of repetitive, regulated paperwork, and every extraction error can turn into a compliance issue, a delayed benefit payment, or a bad member experience.
Architecture
Build this agent as a small pipeline, not a single prompt:
- •
Document ingress
- •Pull PDFs, scans, and emails from S3, SharePoint, or an internal DMS.
- •Keep the raw file immutable for audit.
- •
OCR / text normalization
- •Convert scanned pages to text before sending them to the model.
- •Preserve page numbers and bounding-box metadata where possible.
- •
Extraction agent
- •Use AutoGen to orchestrate a specialist assistant that extracts fields into a strict schema.
- •Keep the output deterministic with JSON-only responses.
- •
Validation layer
- •Validate extracted data against pension rules: member ID format, contribution totals, dates, fund codes.
- •Reject or flag low-confidence records for human review.
- •
Audit store
- •Persist input hash, model output, validation results, and reviewer actions.
- •This is non-negotiable for pension compliance and dispute handling.
- •
Human escalation queue
- •Route ambiguous cases to operations staff.
- •Never auto-post high-risk changes like beneficiary updates without review.
Implementation
1) Install AutoGen and define your extraction schema
Use the AutoGen TypeScript package and Zod to enforce structure. Pension funds need strict outputs because “best effort” extraction is not acceptable when you are handling member money and regulated records.
npm install @autogenai/autogen zod
import { z } from "zod";
export const PensionExtractionSchema = z.object({
documentType: z.enum([
"contribution_statement",
"benefit_statement",
"transfer_form",
"member_letter",
"claim_form"
]),
memberId: z.string().min(1),
schemeName: z.string().min(1),
reportingPeriod: z.string().optional(),
currency: z.string().length(3),
totalContribution: z.number().optional(),
employerContribution: z.number().optional(),
employeeContribution: z.number().optional(),
effectiveDate: z.string().optional(),
confidence: z.number().min(0).max(1),
issues: z.array(z.string()).default([])
});
export type PensionExtraction = z.infer<typeof PensionExtractionSchema>;
2) Create an AutoGen assistant that only extracts structured data
The pattern here is simple: one assistant gets the document text and returns JSON. In production I keep this agent narrow on purpose; the more it reasons outside extraction, the more drift you get.
import { AssistantAgent } from "@autogenai/autogen";
export const extractor = new AssistantAgent({
name: "pension_extractor",
systemMessage: `
You extract pension document data into JSON only.
Return only valid JSON matching the requested schema.
If a field is missing or unclear, set it to null or omit it.
Never invent values.
`
});
3) Run the extraction and validate before writing downstream
This is the core flow. You feed normalized text into extractor.generateReply(), parse the response, then validate with Zod before any CRM or pension admin update happens.
import { PensionExtractionSchema } from "./schema";
import { extractor } from "./agent";
async function extractPensionDocument(documentText: string) {
const prompt = `
Extract fields from this pension document and return JSON only.
Document:
${documentText}
`;
const reply = await extractor.generateReply([
{ role: "user", content: prompt }
]);
const raw = typeof reply === "string" ? reply : reply.content;
const parsed = JSON.parse(raw);
const validated = PensionExtractionSchema.safeParse(parsed);
if (!validated.success) {
return {
status: "needs_review",
errors: validated.error.flatten(),
raw
};
}
if (validated.data.confidence < 0.85) {
return {
status: "needs_review",
data: validated.data,
reason: "low_confidence"
};
}
return {
status: "approved",
data: validated.data
};
}
4) Add a second agent for verification on high-risk fields
For transfer forms and beneficiary changes, use a second AutoGen agent as a verifier. This gives you an explicit check before anything touches production records.
import { AssistantAgent } from "@autogenai/autogen";
const verifier = new AssistantAgent({
name: "pension_verifier",
systemMessage: `
You verify extracted pension fields against source text.
Respond with JSON:
{
"isConsistent": boolean,
"findings": string[]
}
`
});
export async function verifyExtraction(sourceText: string, extractedJson: string) {
const response = await verifier.generateReply([
{
role: "user",
content: `Source text:\n${sourceText}\n\nExtracted JSON:\n${extractedJson}`
}
]);
return typeof response === "string" ? JSON.parse(response) : JSON.parse(response.content);
}
Production Considerations
- •
Data residency
- •Keep OCR output, prompts, and model responses in-region if your pension administrator operates under local residency rules.
- •If documents contain national IDs or health-related retirement claims data, treat them as sensitive personal data.
- •
Auditability
- •Store document hash, model version, prompt version, extracted JSON, validation outcome, and reviewer ID.
- •Regulators will ask how a value was produced; “the model said so” is not an answer.
- •
Guardrails
- •Block auto-posting for beneficiary changes, bank detail changes, transfer instructions, and any record that affects benefits payment.
- •
Monitoring
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit