How to Build a document extraction Agent Using LangGraph in TypeScript for pension funds

By Cyprian AaronsUpdated 2026-04-21
document-extractionlanggraphtypescriptpension-funds

A document extraction agent for pension funds reads incoming PDFs, scans, statements, contribution schedules, and benefit letters, then turns them into structured data your downstream systems can trust. That matters because pension operations are full of repetitive document handling, strict audit requirements, and high-cost errors when member data, contribution amounts, or beneficiary details are extracted incorrectly.

Architecture

Build this agent with a small set of components that each do one job well:

  • Document ingestion layer

    • Accept PDFs, images, and email attachments.
    • Normalize files into text plus metadata like source system, upload time, and member ID.
  • OCR / text extraction service

    • Use OCR for scanned documents.
    • Preserve page numbers and bounding context for audit trails.
  • LangGraph orchestration layer

    • Route documents through classify → extract → validate → review.
    • Keep the workflow explicit so every step is observable and replayable.
  • LLM extraction node

    • Convert unstructured text into a typed schema.
    • Extract fields like member name, scheme number, contribution period, employer name, and amounts.
  • Validation and policy checks

    • Verify required fields, date formats, currency values, and cross-field consistency.
    • Flag anything that violates pension fund rules or looks incomplete.
  • Human review queue

    • Send low-confidence or policy-sensitive cases to operations staff.
    • Store reviewer decisions for audit and model improvement.

Implementation

1) Define the schema and graph state

For pension funds, your schema should be strict. Loose JSON is how you end up with broken downstream postings and bad audit evidence.

import { z } from "zod";
import { StateGraph, Annotation, START, END } from "@langchain/langgraph";

const ExtractionSchema = z.object({
  documentType: z.enum(["contribution_statement", "benefit_letter", "member_form", "other"]),
  memberName: z.string().optional(),
  schemeNumber: z.string().optional(),
  employerName: z.string().optional(),
  contributionPeriod: z.string().optional(),
  currency: z.string().optional(),
  totalContribution: z.number().optional(),
  confidence: z.number().min(0).max(1),
});

const GraphState = Annotation.Root({
  rawText: Annotation<string>(),
  documentType: Annotation<string>(),
  extracted: Annotation<z.infer<typeof ExtractionSchema> | null>(),
  needsReview: Annotation<boolean>(),
});

2) Add classification and extraction nodes

Use a classifier first. Pension fund documents vary enough that routing early reduces bad extractions and keeps prompts tighter.

import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });

async function classifyDocument(state: typeof GraphState.State) {
  const result = await llm.invoke([
    {
      role: "system",
      content:
        "Classify the pension document into one of: contribution_statement, benefit_letter, member_form, other.",
    },
    { role: "user", content: state.rawText },
  ]);

  return { documentType: result.content.toString().trim() };
}

async function extractFields(state: typeof GraphState.State) {
  const structured = await llm.withStructuredOutput(ExtractionSchema).invoke([
    {
      role: "system",
      content:
        "Extract pension fund document fields exactly. Return null/empty when absent. Do not infer values.",
    },
    {
      role: "user",
      content: `Document type: ${state.documentType}\n\nText:\n${state.rawText}`,
    },
  ]);

  return { extracted: structured };
}

3) Validate results and route to review when needed

This is where production systems differ from demos. You need deterministic checks before anything hits a ledger or case management system.

function validateExtraction(state: typeof GraphState.State) {
  const e = state.extracted;
  if (!e) return { needsReview: true };

  const missingCritical =
    !e.memberName || !e.schemeNumber || e.confidence < 0.85;

  const invalidAmount =
    typeof e.totalContribution === "number" && e.totalContribution < 0;

   return {
    needsReview:
      missingCritical ||
      invalidAmount ||
      (e.documentType === "contribution_statement" && !e.contributionPeriod),
   };
}

async function humanReviewQueue(state: typeof GraphState.State) {
   // Persist to your case management system here.
   // Include rawText hash, source metadata, reviewer notes, and model version.
   return { needsReview: true };
}

4) Assemble the LangGraph workflow

This is the actual orchestration pattern you want in TypeScript.

const graph = new StateGraph(GraphState)
   .addNode("classifyDocument", classifyDocument)
   .addNode("extractFields", extractFields)
   .addNode("validateExtraction", validateExtraction)
   .addNode("humanReviewQueue", humanReviewQueue)
   .addEdge(START, "classifyDocument")
   .addEdge("classifyDocument", "extractFields")
   .addEdge("extractFields", "validateExtraction")
   .addConditionalEdges("validateExtraction", (state) =>
     state.needsReview ? "humanReviewQueue" : END
   )
   .addEdge("humanReviewQueue", END);

const app = graph.compile();

const result = await app.invoke({
   rawText:
     "Member Name: Jane Doe\nScheme Number: PF-10291\nEmployer: Northwind Ltd\nContribution Period: March 2026\nTotal Contribution: GBP 1245.50",
   documentType: "",
   extracted: null,
   needsReview: false,
});

console.log(result);

Production Considerations

  • Data residency

    • Keep OCR output, prompts, and extracted payloads in-region if your pension fund operates under local residency rules.
    • Avoid sending raw documents to external services unless contracts explicitly cover retention and processing location.
  • Auditability

    • Store the original document hash, model version, prompt version, extraction result, validation outcome, and reviewer action.
    • Pension administrators need traceability when members dispute contributions or benefit calculations.
  • Guardrails

    • Reject outputs that fail schema validation or contain inferred values not present in the source.
    • Add deterministic checks for currency codes, dates, scheme numbers, and contribution totals before persistence.
  • Monitoring

    • Track extraction accuracy by document type and by upstream source system.
    • Alert on spikes in review rate; that usually means OCR quality dropped or a template changed.

Common Pitfalls

  1. Using one prompt for every document type

    • Contribution statements and benefit letters do not have the same structure.
    • Split by class first or your extraction quality will drift fast.
  2. Skipping validation because the LLM returned valid JSON

    • Valid JSON is not valid business data.
    • Always check required fields, numeric ranges, date formats, and cross-field consistency.
  3. Not preserving provenance

    • If you cannot show where a field came from on the source document, you will struggle in audits.
    • Persist page number references or at least a document hash plus extraction trace for every record.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides