How to Build a document extraction Agent Using LlamaIndex in TypeScript for pension funds

By Cyprian AaronsUpdated 2026-04-21
document-extractionllamaindextypescriptpension-funds

A document extraction agent for pension funds reads incoming PDFs, scans, statements, contribution schedules, and benefit forms, then turns them into structured data your downstream systems can trust. It matters because pension operations are document-heavy, regulated, and audit-sensitive; if you misread a beneficiary name, contribution amount, or policy date, you create compliance risk and operational rework.

Architecture

  • Document ingestion layer

    • Pulls files from S3, SharePoint, SFTP, or an internal file drop.
    • Normalizes PDFs and text files into Document objects.
  • Extraction orchestrator

    • Uses LlamaIndex to chunk documents and route them through the extraction pipeline.
    • Keeps extraction logic separate from storage and validation.
  • Structured output model

    • Defines the exact fields pension teams need:
      • member ID
      • employer name
      • contribution period
      • amount
      • policy number
      • effective date
      • fund name
  • Validation and rules engine

    • Checks extracted values against business rules.
    • Flags missing fields, invalid dates, negative amounts, and mismatched identifiers.
  • Audit trail store

    • Persists source document metadata, extracted JSON, confidence notes, and review status.
    • Required for compliance reviews and dispute handling.
  • Human review queue

    • Sends low-confidence or high-risk documents to an operations analyst.
    • Prevents silent failures on regulated records.

Implementation

1. Install the LlamaIndex TypeScript packages

Use the TypeScript runtime packages that expose Document, VectorStoreIndex, Settings, OpenAIEmbedding, and OpenAI.

npm install llamaindex zod dotenv

Set your environment variables:

export OPENAI_API_KEY="your-key"

2. Load pension documents into LlamaIndex

For production, you usually load PDFs from object storage or a secure file share. The key point is that each file becomes a Document with metadata you can audit later.

import "dotenv/config";
import { Document } from "llamaindex";

type PensionFile = {
  id: string;
  filename: string;
  text: string;
};

const files: PensionFile[] = [
  {
    id: "doc-001",
    filename: "member-contribution-april.pdf",
    text: "Member ID: PNS-44821\nEmployer: Northwind Metals\nContribution Period: April 2026\nAmount: 1250.00\nPolicy Number: POL-77881\nEffective Date: 2026-04-01",
  },
];

const documents = files.map(
  (file) =>
    new Document({
      text: file.text,
      metadata: {
        sourceId: file.id,
        filename: file.filename,
        documentType: "contribution_statement",
        jurisdiction: "ZA",
      },
    }),
);

3. Build an index and extract structured fields

For extraction workloads, I use a retrieval-backed pattern even when the source is a single document. It gives you a clean path to scale from one statement to thousands of pages across multiple repositories.

import {
  VectorStoreIndex,
  Settings,
  OpenAIEmbedding,
  OpenAI,
} from "llamaindex";
import { z } from "zod";

Settings.embedModel = new OpenAIEmbedding({
  model: "text-embedding-3-small",
});
Settings.llm = new OpenAI({
  model: "gpt-4o-mini",
});

const PensionExtractionSchema = z.object({
  memberId: z.string(),
  employerName: z.string(),
  contributionPeriod: z.string(),
  amount: z.number(),
  policyNumber: z.string(),
  effectiveDate: z.string(),
});

async function extractPensionData() {
  const index = await VectorStoreIndex.fromDocuments(documents);
  const queryEngine = index.asQueryEngine();

  const response = await queryEngine.query({
    query:
      "Extract memberId, employerName, contributionPeriod, amount, policyNumber, and effectiveDate as JSON only.",
    // keep the model anchored to the source text
    similarityTopK: 3,
  });

  const rawText =
    typeof response.response === "string"
      ? response.response
      : String(response.response);

  return PensionExtractionSchema.parse(JSON.parse(rawText));
}

extractPensionData().then(console.log);

This pattern works because LlamaIndex handles retrieval over the indexed content while your schema enforces structure at the edge. In pension workflows, that schema boundary is where you stop free-form model output from leaking into finance systems.

4. Add validation and routing for review

Never auto-post extracted pension data straight into the core admin system. Validate first, then route exceptions to humans.

function validatePensionRecord(record: z.infer<typeof PensionExtractionSchema>) {
  const issues: string[] = [];

  if (!record.memberId.startsWith("PNS-")) issues.push("Invalid member ID format");
  if (record.amount <= 0) issues.push("Contribution amount must be positive");
  
}

async function runPipeline() {
  
}

In practice, complete this step by storing:

  • source metadata from Document.metadata
  • parsed JSON payload
  • validation results
  • reviewer outcome

That gives you traceability for audits and disputes.

Production Considerations

  • Data residency

    • Keep embeddings, vector stores, and logs in-region if your pension fund operates under local residency rules.
    • Don’t ship member documents to unmanaged third-party services without a legal basis.
  • Auditability

    • Persist every extraction with:
      • document hash
      • source filename
      • model version
      • prompt version
      • extracted JSON
      • reviewer override history
  • Guardrails

    • Block auto-processing when critical fields are missing or confidence is low.
    • Require human sign-off for beneficiary changes, retirement claims, or bank detail updates.
  • Monitoring


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides