How to Build a document extraction Agent Using LlamaIndex in TypeScript for insurance
A document extraction agent for insurance reads inbound PDFs, scans, emails, and claim packets, then turns them into structured fields your downstream systems can trust. For insurance teams, that means faster FNOL intake, less manual data entry, cleaner underwriting workflows, and a better audit trail when regulators ask how a field was extracted.
Architecture
- •
Document ingestion layer
- •Pulls files from S3, SharePoint, email attachments, or an internal upload API.
- •Normalizes everything into text plus metadata like source, tenant, policy number, and jurisdiction.
- •
Document parsing layer
- •Uses LlamaIndex readers to load PDFs and other supported formats.
- •Handles page-level text extraction before any LLM call.
- •
Extraction schema
- •Defines the exact fields you need: claimant name, policy number, loss date, coverage type, reserve amount, and so on.
- •Keeps the output stable for downstream claims or underwriting systems.
- •
LLM extraction engine
- •Uses LlamaIndex’s structured prediction APIs to map unstructured text into typed objects.
- •Returns JSON that matches your business schema instead of free-form prose.
- •
Validation and guardrails
- •Verifies required fields, date formats, currency values, and jurisdiction-specific rules.
- •Blocks incomplete or low-confidence outputs from reaching core systems.
- •
Audit and persistence
- •Stores input document references, extracted payloads, model version, prompt version, and timestamps.
- •Gives compliance teams a traceable extraction record.
Implementation
1) Install dependencies and set up the model
Use the TypeScript package from LlamaIndex and point it at an OpenAI-compatible model. In production insurance systems, keep the model configuration explicit so you can pin versions for auditability.
npm install llamaindex zod
import { OpenAI } from "llamaindex";
const llm = new OpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
2) Define the insurance extraction schema
Keep the schema narrow. If you ask for too much in one pass, you get brittle results and more validation failures.
import { z } from "zod";
const ClaimExtractionSchema = z.object({
claimNumber: z.string().min(1),
policyNumber: z.string().min(1),
claimantName: z.string().min(1),
lossDate: z.string().min(1), // parse downstream into a real date
lossLocation: z.string().optional(),
coverageType: z.enum(["auto", "property", "liability", "life", "health", "other"]),
estimatedLossAmount: z.number().optional(),
});
type ClaimExtraction = z.infer<typeof ClaimExtractionSchema>;
This is where insurance teams usually get disciplined. A typed schema forces consistency across claims intake, underwriting submissions, and broker documents.
3) Load the document and extract structured data with LlamaIndex
Use SimpleDirectoryReader for local files during development. For PDFs in production you’ll usually wrap this behind your own ingestion service that writes the raw file to object storage first.
import {
SimpleDirectoryReader,
} from "llamaindex";
import { Document } from "llamaindex";
import { OpenAI } from "llamaindex";
import { z } from "zod";
const llm = new OpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
const ClaimExtractionSchema = z.object({
claimNumber: z.string(),
policyNumber: z.string(),
claimantName: z.string(),
lossDate: z.string(),
lossLocation: z.string().optional(),
coverageType: z.enum(["auto", "property", "liability", "life", "health", "other"]),
estimatedLossAmount: z.number().optional(),
});
async function main() {
const reader = new SimpleDirectoryReader();
const docs = await reader.loadData({ directoryPath: "./claims" });
const text = docs.map((d) => d.text).join("\n\n");
const extractor = llm.structuredPredict(ClaimExtractionSchema);
const result = await extractor(
`Extract claim fields from this insurance document:\n\n${text}`
// The structured predictor returns validated JSON matching the Zod schema.
// Keep prompts deterministic for repeatable audits.
);
console.log(JSON.stringify(result, null, 2));
}
main().catch(console.error);
That pattern is the core of the agent. You are not asking the model to “summarize” a claim packet. You are forcing it to emit a typed object that your system can validate before persistence.
4) Add post-processing validation before writing to claims systems
Do not trust raw model output just because it parsed. Validate business rules separately so bad data never reaches Guidewire-style backends or internal case management tools.
function validateClaim(extraction: ClaimExtraction) {
if (!extraction.claimNumber.startsWith("CLM-")) {
throw new Error("Invalid claim number format");
}
}
A better pattern is to run domain checks after Zod validation:
- •Loss date cannot be in the future.
- •Policy number must exist in your policy admin system.
- •Estimated loss amount must be non-negative.
- •Coverage type must match product line allowed in that jurisdiction.
Production Considerations
- •
Data residency
- •Route EU or state-restricted documents to approved regions only.
- •Keep raw documents and extracted payloads in-region if your regulatory posture requires it.
- •
Auditability
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit