How to Build a document extraction Agent Using AutoGen in TypeScript for insurance
A document extraction agent for insurance takes messy inputs like FNOL forms, claim letters, policy PDFs, medical reports, and repair estimates, then turns them into structured fields your downstream systems can trust. That matters because insurance operations live and die on turnaround time, auditability, and consistency; if extraction is slow or inconsistent, claims queues back up and compliance risk goes up.
Architecture
- •
Document ingestion layer
- •Accept PDFs, scanned images, email attachments, and OCR text.
- •Normalize file metadata like source system, policy number, claim ID, and jurisdiction.
- •
OCR / text preprocessing service
- •Convert images and scanned PDFs into text before the agent sees them.
- •Preserve page numbers and bounding boxes when available for traceability.
- •
AutoGen extraction agent
- •Use an
AssistantAgentto extract structured insurance fields from the text. - •Force output into a strict JSON schema for downstream validation.
- •Use an
- •
Validation and rules engine
- •Check required fields like claimant name, loss date, policy number, coverage type.
- •Enforce insurance-specific rules such as date ordering, currency format, and jurisdiction constraints.
- •
Human review queue
- •Route low-confidence or incomplete extractions to an adjuster or operations analyst.
- •Store the original text plus model output for audit.
- •
Persistence and audit store
- •Save raw documents, extracted JSON, validation errors, model version, prompt version, and timestamps.
- •Keep data residency boundaries aligned with your insurance region requirements.
Implementation
1) Install dependencies and define the extraction schema
Use AutoGen’s TypeScript package with a schema-first approach. In insurance workflows, you want deterministic output fields rather than free-form prose.
npm install @autogenai/autogen zod
import { z } from "zod";
export const InsuranceExtractionSchema = z.object({
claimId: z.string().optional(),
policyNumber: z.string().optional(),
claimantName: z.string().optional(),
insuredName: z.string().optional(),
lossDate: z.string().optional(),
reportDate: z.string().optional(),
lossType: z.enum(["auto", "property", "health", "life", "workers_comp", "other"]).optional(),
amountClaimed: z.number().optional(),
currency: z.string().default("USD"),
jurisdiction: z.string().optional(),
confidenceNotes: z.array(z.string()).default([]),
});
export type InsuranceExtraction = z.infer<typeof InsuranceExtractionSchema>;
2) Create an AutoGen assistant that extracts only structured data
AutoGen’s AssistantAgent is the core worker here. The trick is to constrain behavior with a system message that tells the model exactly what to return.
import { AssistantAgent } from "@autogenai/autogen";
import { InsuranceExtractionSchema } from "./schema";
const extractor = new AssistantAgent({
name: "insurance_extractor",
modelClient: {
model: "gpt-4o-mini",
apiKey: process.env.OPENAI_API_KEY!,
temperature: 0,
},
systemMessage: `
You extract insurance document fields from raw text.
Rules:
- Return only valid JSON.
- Do not invent values.
- If a field is missing, omit it or set it to null.
- Preserve exact policy numbers and claim IDs when present.
- Add confidenceNotes for ambiguous or conflicting values.
- Never include commentary outside JSON.
`,
});
3) Run the agent on OCR text and validate the result
The common pattern is: preprocess document text first, send it as a user message to AssistantAgent, then parse and validate the response before writing anything downstream.
async function extractInsuranceFields(documentText: string) {
const result = await extractor.run([
{
role: "user",
content: `
Extract structured data from this insurance document.
Document:
${documentText}
`,
},
]);
const raw = result.messages[result.messages.length - 1]?.content;
if (typeof raw !== "string") {
throw new Error("Extractor returned non-text content");
}
const parsed = JSON.parse(raw);
const validated = InsuranceExtractionSchema.parse(parsed);
return validated;
}
4) Add an orchestration layer for review routing
In production you do not want every extraction to auto-post into claims systems. Route uncertain outputs to humans using simple confidence rules tied to missing critical fields.
type ReviewDecision = "auto_accept" | "manual_review";
function needsReview(doc: Awaited<ReturnType<typeof extractInsuranceFields>>): ReviewDecision {
const criticalMissing =
!doc.policyNumber ||
!doc.claimantName ||
!doc.lossDate ||
!doc.lossType;
const hasAmbiguity = doc.confidenceNotes.length > 0;
return criticalMissing || hasAmbiguity ? "manual_review" : "auto_accept";
}
async function processDocument(documentText: string) {
const extracted = await extractInsuranceFields(documentText);
const decision = needsReview(extracted);
return {
decision,
extracted,
audit: {
model: "gpt-4o-mini",
agent: "insurance_extractor",
timestamp: new Date().toISOString(),
},
};
}
Production Considerations
- •
Data residency
- •Keep OCR text, prompts, outputs, and logs inside the required region for your line of business.
- •If you operate across countries, partition storage by jurisdiction instead of centralizing everything in one bucket.
- •
Audit trail
- •Persist raw input text alongside extracted JSON and validation results.
- •Store prompt version, model version, operator review action, and final approved values.
- •
Guardrails
- •Reject outputs that fail schema validation or contain unsupported fields.
- •
Monitoring
Track extraction accuracy by document type:
- •FNOL forms
- •repair estimates
- •medical invoices
- •police reports
Alert on spikes in manual review rate or missing critical fields. That usually means OCR quality dropped or the upstream template changed.
Common Pitfalls
- •
Letting the model free-write instead of enforcing structure
- •Fix it by requiring JSON-only output and validating with Zod before any persistence step.
- •In insurance workflows, unstructured answers create bad claims records fast.
- •
Skipping document provenance
- •Fix it by storing source file hash, page count, OCR engine version, and extraction timestamp.
- •When auditors ask why a field was set a certain way, you need traceability back to the original page.
- •
Treating all extractions as equally trustworthy
- •Fix it by routing incomplete or ambiguous documents to human review.
- •A missing policy number on a life claim is not a minor issue; it blocks downstream adjudication and compliance checks.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit