How to Build a KYC verification Agent Using LlamaIndex in TypeScript for insurance

By Cyprian AaronsUpdated 2026-04-21
kyc-verificationllamaindextypescriptinsurance

A KYC verification agent for insurance checks customer identity documents, extracts key fields, compares them against policy application data, and flags mismatches for human review. It matters because insurers need faster onboarding without weakening compliance, auditability, or fraud controls.

Architecture

  • Document ingestion layer

    • Accepts PDFs, scans, and image-derived text from passports, driver’s licenses, utility bills, and proof-of-address documents.
    • Normalizes input into text chunks that LlamaIndex can index.
  • KYC extraction agent

    • Uses an LLM-backed query engine to pull structured fields like full name, DOB, address, document number, and expiry date.
    • Produces JSON-like output that downstream systems can validate.
  • Policy application matcher

    • Compares extracted KYC data against insurer CRM or policy intake records.
    • Detects mismatches in spelling, address formatting, and document validity.
  • Risk and compliance rules layer

    • Enforces insurer-specific rules: sanctioned geography checks, age thresholds, document freshness, and missing-field policies.
    • Keeps deterministic checks outside the LLM.
  • Audit logging layer

    • Stores prompts, retrieval context, model outputs, timestamps, and reviewer decisions.
    • Supports internal audit and regulator inquiries.
  • Human review queue

    • Escalates low-confidence cases to ops or compliance staff.
    • Prevents the agent from auto-approving edge cases.

Implementation

1) Install the LlamaIndex TypeScript packages

Use the core package plus an OpenAI-compatible LLM client. In production insurance workflows, keep the model choice explicit so you can control residency and retention policies.

npm install llamaindex dotenv zod

Set your environment variables:

OPENAI_API_KEY=your_key

2) Load KYC documents into a vector index

The pattern here is simple: ingest the document text once, index it with VectorStoreIndex, then query specific KYC fields with a structured prompt. This works well for insurance because most onboarding packets are small enough to index per case.

import "dotenv/config";
import {
  Document,
  VectorStoreIndex,
  Settings,
} from "llamaindex";
import { OpenAI } from "@llamaindex/openai";

Settings.llm = new OpenAI({
  model: "gpt-4o-mini",
});

const kycText = `
Customer: Jane A. Doe
Date of Birth: 1990-04-12
Address: 18 King Street, London SW1A 1AA
Document Type: Passport
Document Number: X12345678
Expiry Date: 2030-08-15
`;

async function main() {
  const docs = [
    new Document({
      text: kycText,
      metadata: { caseId: "case_001", source: "uploaded_passport" },
    }),
  ];

  const index = await VectorStoreIndex.fromDocuments(docs);
  const queryEngine = index.asQueryEngine();

  const response = await queryEngine.query({
    query: `
Extract these KYC fields as JSON:
fullName, dateOfBirth, address, documentType, documentNumber, expiryDate.
Return only valid JSON.
`,
  });

  console.log(response.toString());
}

main();

3) Add deterministic validation around the LLM output

Do not trust the model output blindly. Parse it with zod, then compare it to application data using exact rules you control. That keeps the decision logic auditable.

import { z } from "zod";

const KycSchema = z.object({
  fullName: z.string(),
  dateOfBirth: z.string(),
  address: z.string(),
  documentType: z.string(),
  documentNumber: z.string(),
  expiryDate: z.string(),
});

type KycRecord = z.infer<typeof KycSchema>;

function validateKyc(extracted: unknown, applicationData: KycRecord) {
  const parsed = KycSchema.parse(extracted);

  const mismatches = [];
  if (parsed.fullName.toLowerCase() !== applicationData.fullName.toLowerCase()) {
    mismatches.push("fullName");
  }
  if (parsed.dateOfBirth !== applicationData.dateOfBirth) {
    mismatches.push("dateOfBirth");
    }
    
  return {
    status: mismatches.length === 0 ? "approved" : "review",
    mismatches,
    extracted: parsed,
  };
}

4) Put retrieval and review behind a case workflow

For insurance operations, each case should produce a review packet with evidence. Store the raw extraction result plus the source metadata so compliance can reconstruct why a case was approved or escalated.

import { QueryEngineTool } from "llamaindex";

async function buildReviewPacket(index: VectorStoreIndex) {
const queryEngine = index.asQueryEngine();

const tool = QueryEngineTool.fromDefaults({
    queryEngine,
    name: "kyc_document_lookup",
    description:
      "Extract identity fields from customer KYC documents for insurance onboarding.",
});

const response = await tool.call({
    input:
      "Return full name, DOB, address, document type, document number and expiry date as JSON.",
});

return {
    evidenceText: response.toString(),
    reviewedAt: new Date().toISOString(),
};
}

Production Considerations

  • Keep PII in-region

    • If your insurer has data residency requirements, host storage and model endpoints in the approved region.
    • Do not ship passport images or raw extracts to unapproved third-party services.
  • Log every decision path

    • Persist prompt text, retrieved context IDs, extraction output, validation result, and human override reason.
    • Auditors care about traceability more than model confidence scores.
  • Use confidence-based routing

    • Auto-pass only when deterministic checks succeed and extraction is complete.
    • Route partial matches or low-quality OCR to manual review.
  • Separate policy rules from model behavior

    • Keep sanction screening thresholds, age rules, and doc-expiry logic in code or rules engines.
    • The LLM should extract facts; it should not decide compliance outcomes alone.

Common Pitfalls

  1. Treating the LLM as the source of truth

    • The model extracts data; your validator decides whether it matches policy intake records.
    • Avoid this by enforcing schema parsing plus deterministic comparison before any approval.
  2. Skipping audit artifacts

    • If you only store the final answer, you lose defensibility when compliance asks why a case passed.
    • Avoid this by storing source metadata, extracted fields, timestamps, and reviewer actions per case.
  3. Ignoring OCR noise in scanned documents

    • Bad scans produce swapped digits and broken addresses that look plausible to an LLM.
    • Avoid this by pre-processing with OCR quality checks and sending low-quality cases straight to human review.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides