How to Build a document extraction Agent Using LangChain in TypeScript for pension funds
A document extraction agent for pension funds reads member forms, transfer requests, benefit statements, contribution schedules, and trustee packs, then turns them into structured data your downstream systems can trust. It matters because pension operations are full of high-volume, high-stakes documents where errors create compliance risk, delayed member servicing, and bad financial decisions.
Architecture
- •
Document ingestion layer
- •Pull PDFs, scans, emails, and Office files from approved sources.
- •Enforce tenant isolation and data residency before any LLM call.
- •
Text extraction and normalization
- •Use OCR for scanned forms.
- •Normalize page order, headers, footers, and duplicated content.
- •
Schema-driven extraction chain
- •Define a strict output schema for fields like member name, NI number, scheme ID, contribution amount, effective date, and employer reference.
- •Force the model to return structured JSON only.
- •
Validation and business rules engine
- •Check extracted values against pension-specific rules.
- •Reject impossible dates, invalid contribution ranges, or mismatched scheme identifiers.
- •
Audit trail store
- •Persist original document hash, extracted payload, model version, prompt version, and human override history.
- •This is non-negotiable for trustee oversight and regulatory review.
- •
Human review queue
- •Route low-confidence or policy-flagged extractions to operations staff.
- •Keep an explicit approval step for member-impacting changes.
Implementation
1) Install the LangChain packages you actually need
For TypeScript, keep the stack small: document loaders, chat model wrapper, and schema validation. If you are processing PDFs or scans in production, pair this with OCR upstream rather than trying to make the LLM do image reading directly.
npm install langchain @langchain/openai zod
2) Define a strict pension extraction schema
Do not let the model invent fields. Pension operations need deterministic outputs that can be validated against scheme rules and audited later.
import { z } from "zod";
export const PensionExtractionSchema = z.object({
memberFullName: z.string().min(1),
nationalInsuranceNumber: z.string().regex(/^[A-CEGHJ-NPR-TW-Z]{2}\d{6}[A-D]$/i),
schemeId: z.string().min(1),
employerReference: z.string().optional(),
documentType: z.enum([
"transfer_request",
"contribution_schedule",
"benefit_statement",
"member_change_form",
"trustee_pack",
]),
effectiveDate: z.string().datetime(),
monetaryAmountGBP: z.number().nonnegative().optional(),
confidenceNotes: z.array(z.string()).default([]),
});
export type PensionExtraction = z.infer<typeof PensionExtractionSchema>;
3) Build the extraction chain with ChatOpenAI and StructuredOutputParser
This pattern gives you structured output without hand-parsing free-form text. It is simple enough to maintain and strict enough for regulated workflows.
import { ChatOpenAI } from "@langchain/openai";
import {
StructuredOutputParser,
} from "langchain/output_parsers";
import { PromptTemplate } from "@langchain/core/prompts";
import { PensionExtractionSchema } from "./schema.js";
const parser = StructuredOutputParser.fromZodSchema(PensionExtractionSchema);
const prompt = PromptTemplate.fromTemplate(`
You are extracting structured data from a pension fund document.
Rules:
- Return only valid JSON matching the schema.
- Do not guess missing values.
- If a field is not present, omit it or use an empty array where required.
- Preserve exact identifiers such as scheme IDs and NI numbers.
- Flag uncertainty in confidenceNotes.
Schema:
{format_instructions}
Document text:
{text}
`);
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
export async function extractPensionDocument(text: string) {
const formattedPrompt = await prompt.format({
text,
format_instructions: parser.getFormatInstructions(),
});
const response = await llm.invoke(formattedPrompt);
const parsed = await parser.parse(response.content.toString());
return parsed;
}
4) Add post-processing rules before writing to your core system
This is where pension-fund-specific control lives. The model extracts; your code decides whether the record is safe to commit.
import { PensionExtractionSchema } from "./schema.js";
export function validatePensionExtraction(raw: unknown) {
const result = PensionExtractionSchema.safeParse(raw);
if (!result.success) {
return {
status: "review_required" as const,
reasons: result.error.issues.map((i) => `${i.path.join(".")}: ${i.message}`),
};
}
const doc = result.data;
const issues: string[] = [];
if (doc.documentType === "contribution_schedule" && doc.monetaryAmountGBP === undefined) {
issues.push("Contribution schedule missing monetary amount.");
}
if (doc.confidenceNotes.length > 0) {
issues.push(...doc.confidenceNotes);
}
return issues.length > 0
? { status: "review_required" as const, reasons: issues }
: { status: "approved" as const, data: doc };
}
Production Considerations
- •
Data residency
- •Keep document storage and model inference inside approved regions.
- •For pension schemes with UK-only processing requirements, do not route payloads through unapproved jurisdictions.
- •
Auditability
- •Log the source document hash, prompt version, model name, extracted JSON, validation outcome, and reviewer identity.
- •You need a full trace when trustees ask why a member record changed.
- •
Monitoring
- •Track extraction accuracy by document type, rejection rate by rule, human-review rate, and average turnaround time.
- •Watch for drift after template changes from employers or administrators.
- •
Guardrails
- •Block writes on low-confidence outputs for member-impacting fields like contribution amounts or effective dates.
- •Require deterministic validation for NI numbers, dates of birth where applicable, scheme IDs, and bank details.
- •Block writes on low-confidence outputs for member-impacting fields like contribution amounts or effective dates.
Common Pitfalls
- •
Using free-form generation instead of strict schemas
- •The model will occasionally invent labels or merge fields across pages.
- •Fix it by using
StructuredOutputParser.fromZodSchema()or a comparable schema-first pattern.
- •
Skipping OCR normalization on scanned documents
- •Pension forms often arrive as scans with stamps, signatures, rotated pages, and poor contrast.
- •Fix it upstream with OCR plus text cleanup before handing content to LangChain.
- •
Writing extracted data straight into core admin systems
- •That creates silent corruption when the model misreads one digit in an NI number or contribution amount.
- •Fix it with a review queue plus business-rule validation before persistence.
- •
Ignoring compliance metadata
- •If you cannot prove which model processed which document under which prompt version, you will struggle in audits.
- •Fix it by storing hashes and versioned metadata alongside every extraction result.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit