How to Build a document extraction Agent Using LlamaIndex in Python for insurance

By Cyprian AaronsUpdated 2026-04-21
document-extractionllamaindexpythoninsurance

A document extraction agent for insurance takes messy inputs like claims forms, loss runs, policy schedules, ACORD PDFs, and broker emails, then turns them into structured fields your downstream systems can trust. That matters because insurance ops still run on documents, and the difference between a clean extraction pipeline and a brittle one shows up in cycle time, leakage, compliance risk, and manual review load.

Architecture

  • Document ingestion layer

    • Pull PDFs, scans, and email attachments from S3, SharePoint, SFTP, or a claims intake queue.
    • Normalize file metadata early: policy number, claim ID, source system, received timestamp.
  • Text extraction layer

    • Use OCR for scanned forms and native PDF parsing for digital files.
    • Keep page-level provenance so every extracted field can be traced back to source pages.
  • LlamaIndex parsing and indexing layer

    • Convert documents into Document objects.
    • Use SentenceSplitter or similar chunking when you need retrieval over long policies or endorsements.
    • Store embeddings in a vector index only if retrieval is needed; pure extraction can skip retrieval entirely.
  • Extraction agent layer

    • Use OpenAIPydanticProgram or LLMTextCompletionProgram style structured output to map text into a typed schema.
    • For insurance, make the schema explicit: claimant name, date of loss, covered peril, deductible, limit, reserve estimate.
  • Validation and audit layer

    • Validate outputs with Pydantic.
    • Persist raw model output, parsed JSON, source document hash, model version, prompt version, and human override history.
  • Human review queue

    • Route low-confidence or incomplete extractions to an adjuster or operations analyst.
    • Never auto-release fields that affect payment without validation rules.

Implementation

  1. Install dependencies and define the extraction schema

Use Pydantic to force the model into a predictable shape. In insurance workflows, schema drift is where bad data enters your core systems.

from typing import Optional
from pydantic import BaseModel, Field

class ClaimExtraction(BaseModel):
    claim_number: Optional[str] = Field(default=None, description="Claim identifier")
    policy_number: Optional[str] = Field(default=None, description="Insurance policy number")
    insured_name: Optional[str] = Field(default=None, description="Named insured")
    date_of_loss: Optional[str] = Field(default=None, description="Date of loss in YYYY-MM-DD format")
    loss_location: Optional[str] = Field(default=None, description="Location where loss occurred")
    peril: Optional[str] = Field(default=None, description="Cause of loss")
    deductible: Optional[str] = Field(default=None, description="Policy deductible amount")
    coverage_limit: Optional[str] = Field(default=None, description="Coverage limit amount")
    adjuster_notes: Optional[str] = Field(default=None, description="Relevant notes from the document")
  1. Load the document and create an extraction program

This pattern uses LlamaIndex’s Document object plus OpenAIPydanticProgram for structured extraction. The same approach works whether the input came from OCR or native PDF text.

from llama_index.core import Document
from llama_index.core.program import OpenAIPydanticProgram
from llama_index.llms.openai import OpenAI

raw_text = """
ACORD CLAIM NOTICE
Claim Number: CLM-104928
Policy Number: P-8891201
Named Insured: Northwind Logistics LLC
Date of Loss: 2025-01-14
Loss Location: Dallas, TX
Cause of Loss: Water damage from burst pipe
Deductible: $5,000
Coverage Limit: $250,000
"""

doc = Document(
    text=raw_text,
    metadata={
        "source": "claims-intake-email",
        "claim_id": "INTAKE-7781",
        "jurisdiction": "US-TX"
    }
)

llm = OpenAI(model="gpt-4o-mini", temperature=0)

program = OpenAIPydanticProgram.from_defaults(
    output_cls=ClaimExtraction,
    llm=llm,
    prompt_template_str=(
        "Extract insurance claim fields from the text below.\n"
        "Return only values supported by the document.\n\n"
        "{input}"
    ),
)
  1. Run extraction and persist traceable results

Keep both the parsed object and the raw text around. In regulated environments you need replayability for audits and dispute handling.

result = program(input=doc.text)

print(result.model_dump())

audit_record = {
    "source": doc.metadata["source"],
    "claim_id": doc.metadata["claim_id"],
    "jurisdiction": doc.metadata["jurisdiction"],
    "extracted": result.model_dump(),
}

If you need retrieval over long documents like policy booklets or endorsements before extraction, use LlamaIndex indexing primitives first:

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents([doc])
query_engine = index.as_query_engine(similarity_top_k=3)

context_response = query_engine.query("Find deductible and coverage limit.")
print(str(context_response))
  1. Add deterministic validation before downstream writes

Insurance systems should not trust raw LLM output. Validate dates, currency formats, jurisdiction-specific rules, and required fields before posting to claims admin or policy admin systems.

from datetime import datetime

def validate_claim(extraction: ClaimExtraction) -> list[str]:
    errors = []

    if extraction.date_of_loss:
        try:
            datetime.strptime(extraction.date_of_loss, "%Y-%m-%d")
        except ValueError:
            errors.append("date_of_loss must be YYYY-MM-DD")

    if not extraction.policy_number:
        errors.append("policy_number is required")

    if not extraction.claim_number:
        errors.append("claim_number is required")

    return errors

errors = validate_claim(result)
if errors:
    print({"status": "review_required", "errors": errors})
else:
    print({"status": "approved_for_posting", "data": result.model_dump()})

Production Considerations

  • Data residency

    • Keep PHI/PII-bearing documents in-region if your insurer operates under local residency rules.
    • Pin model endpoints and storage buckets to approved regions.
  • Auditability

    • Store prompt version, model name, document hash, extracted JSON, validation results, and reviewer overrides.
    • This is non-negotiable when a claimant disputes how a field was derived.
  • Guardrails

    • Block auto-posting when key fields are missing or contradictory.
    • Use allowlists for output fields; do not let the model invent new ones that your downstream systems do not expect.
  • Monitoring

    • Track field-level accuracy by document type: FNOLs behave differently from repair estimates or medical bills.
    • Watch for OCR degradation on scans because bad text input looks like model failure unless you measure it separately.

Common Pitfalls

  1. Treating OCR text as clean input

    • Scanned insurance docs often contain broken line breaks and merged fields.
    • Fix this by normalizing OCR output before passing it into LlamaIndex programs.
  2. Using free-form generation instead of typed schemas

    • If you ask the model to “extract key details,” you will get inconsistent shapes across claims.
    • Use Pydantic models with OpenAIPydanticProgram so every response matches a contract.
  3. Skipping provenance

    • If you cannot trace an extracted deductible back to page 3 of an endorsement packet during an audit review, the pipeline is not production-ready.
    • Persist source metadata at document and page level from day one.
  4. Auto-trusting low-confidence fields

    • A wrong policy number can route a claim to the wrong account; a wrong date of loss can break coverage analysis.
    • Add rule-based checks plus human review thresholds before any write-back to core insurance systems.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides