How to Build a document extraction Agent Using LangChain in TypeScript for insurance

By Cyprian AaronsUpdated 2026-04-21
document-extractionlangchaintypescriptinsurance

A document extraction agent for insurance takes messy inputs like claim forms, ACORD PDFs, medical reports, repair invoices, and policy schedules, then turns them into structured data your downstream systems can trust. That matters because insurance ops live or die on turnaround time, auditability, and accuracy: one bad extraction can delay a claim, misprice a risk, or create a compliance issue.

Architecture

  • Document ingestion layer

    • Accept PDFs, scanned images, email attachments, and office docs.
    • Normalize them into text plus metadata like filename, source system, claimant ID, and ingestion timestamp.
  • OCR / text extraction layer

    • Use OCR for scanned forms and image-based PDFs.
    • Keep page-level provenance so you can trace every extracted field back to the source.
  • LangChain extraction chain

    • Use ChatOpenAI with structured output via withStructuredOutput().
    • Define a strict schema for insurance fields like policy number, loss date, insured name, claim amount, and coverage type.
  • Validation and enrichment layer

    • Validate required fields, date formats, currency values, and cross-field consistency.
    • Add business rules like “loss date cannot be after submission date.”
  • Persistence and audit layer

    • Store the raw document hash, extracted JSON, model version, prompt version, and confidence signals.
    • This is what makes the system defensible during audits and disputes.

Implementation

1) Install dependencies

Use LangChain’s TypeScript packages plus a PDF loader. For production insurance workflows, keep the ingestion pipeline separate from the extraction chain so you can swap OCR vendors later without touching your prompt logic.

npm install langchain @langchain/openai @langchain/core zod pdf-parse

Set your environment variable:

export OPENAI_API_KEY="your-key"

2) Define the extraction schema

Insurance extraction should be typed from the start. If you let the model return free-form JSON without validation, you will spend your time cleaning up garbage instead of processing claims.

import { z } from "zod";

export const InsuranceClaimSchema = z.object({
  policyNumber: z.string().min(1),
  claimNumber: z.string().optional(),
  insuredName: z.string().min(1),
  claimantName: z.string().optional(),
  lossDate: z.string().min(1), // ISO date string expected
  submissionDate: z.string().optional(),
  coverageType: z.enum(["auto", "property", "health", "life", "liability", "other"]),
  claimAmount: z.number().nonnegative().optional(),
  currency: z.string().default("USD"),
  adjusterNotes: z.string().optional(),
});

export type InsuranceClaim = z.infer<typeof InsuranceClaimSchema>;

3) Build the LangChain extractor

This pattern uses ChatOpenAI and withStructuredOutput() so the model returns validated data directly into your schema. It’s cleaner than parsing arbitrary text with regexes or post-processing brittle JSON blobs.

import { ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { InsuranceClaimSchema } from "./schema";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

const extractor = llm.withStructuredOutput(InsuranceClaimSchema);

export async function extractInsuranceClaim(docText: string) {
  const result = await extractor.invoke([
    {
      role: "system",
      content:
        "You extract insurance claim data from documents. Return only fields supported by the schema. If a field is missing, omit it rather than guessing.",
    },
    {
      role: "user",
      content: `
Extract structured claim data from this document.

Document:
${docText}
`,
    },
  ]);

  return result;
}

async function main() {
  const sampleDoc = new Document({
    pageContent: `
ACME INSURANCE CLAIM FORM
Policy Number: POL-10482
Claim Number: CLM-77821
Insured Name: Jordan Smith
Loss Date: 2025-02-14
Coverage Type: auto
Estimated Claim Amount: $4,250.00
Currency: USD
Adjuster Notes: Front bumper damage after collision.
`,
    metadata: {
      sourceFile: "claim-form.pdf",
      tenantId: "insurer-east",
    },
  });

  const extracted = await extractInsuranceClaim(sampleDoc.pageContent);
  console.log(JSON.stringify(extracted, null, 2));
}

main().catch(console.error);

4) Add validation and audit logging

Don’t stop at model output. In insurance workflows you need deterministic checks before data enters claims systems or underwriting queues.

import { InsuranceClaimSchema } from "./schema";

export function validateAndAudit(rawResult: unknown) {
  const parsed = InsuranceClaimSchema.safeParse(rawResult);

  if (!parsed.success) {
    return {
      ok: false,
      errors: parsed.error.flatten(),
    };
  }

  const claim = parsed.data;

  // Example business rule checks
  if (claim.submissionDate && claim.lossDate > claim.submissionDate) {
    return {
      ok: false,
      errors: {
        businessRule:
          "lossDate cannot be after submissionDate for this workflow",
      },
    };
    }

  return {
    ok: true,
    data: claim,
    auditRecord: {
      extractedAt: new Date().toISOString(),
      modelProvider: "openai",
      modelName: "gpt-4o-mini",
      schemaVersion: "v1",
    },
  };
}

Production Considerations

  • Data residency

    • Insurance data often has jurisdictional constraints.
    • Route documents to region-specific deployments and avoid sending regulated content to endpoints that violate residency requirements.
  • Auditability

    • Persist the original document hash, extracted JSON, prompt version, model name, and validation outcome.
    • If an adjuster challenges a decision later, you need evidence of what was seen and what was returned.
  • Guardrails

    • Never allow the model to invent missing policy numbers or dates.
    • Use schema validation plus business rules before writing to claims or policy admin systems.
  • Monitoring

    • Track field-level accuracy on sampled documents.
policyNumber accuracy
lossDate parse failures
manual review rate
extraction latency p95

Common Pitfalls

  1. Using free-form text output instead of structured output

    • This leads to brittle parsing logic and silent failures.
    • Use withStructuredOutput() with a Zod schema so invalid outputs fail fast.
  2. Skipping provenance

    • If you don’t store which page or document produced each field, audits become painful.
    • Keep source metadata alongside every extracted record.
  3. Trusting OCR blindly

    • Scanned insurance documents often contain skewed text, merged lines, and missing characters.
    • Run OCR confidence thresholds and send low-confidence pages to manual review.
  4. Ignoring business rules

    • A valid JSON object can still be wrong for insurance operations.
    • Add checks for impossible dates, negative amounts where they don’t make sense, and mismatched policy/claim identifiers.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides