How to Build a document extraction Agent Using LangGraph in Python for insurance

By Cyprian AaronsUpdated 2026-04-21
document-extractionlanggraphpythoninsurance

A document extraction agent for insurance takes inbound PDFs, scans, emails, and claim forms, then turns them into structured data you can trust: policy numbers, claimant details, loss dates, coverage types, reserve signals, and missing-field flags. It matters because most insurance workflows still choke on unstructured documents, and every manual handoff adds delay, cost, and compliance risk.

Architecture

  • Document ingestion layer
    • Accepts PDFs, images, or OCR text from claims intake, underwriting submissions, or broker packets.
  • Extraction node
    • Uses an LLM to extract a strict JSON payload from the document content.
  • Validation node
    • Checks schema completeness, field formats, and business rules like date logic or policy number patterns.
  • Exception routing
    • Sends low-confidence or incomplete extractions to human review.
  • Audit trail store
    • Persists raw input, extracted output, validation results, and model version for regulatory review.
  • Orchestration graph
    • Coordinates the flow using LangGraph so each step is explicit and observable.

Implementation

1) Define the extraction schema

For insurance, do not extract free-form text unless you have to. Use a typed schema so downstream systems can rely on the output.

from typing import Optional
from pydantic import BaseModel, Field

class InsuranceDocument(BaseModel):
    policy_number: str = Field(..., description="Insurance policy number")
    claimant_name: str = Field(..., description="Name of claimant or insured")
    loss_date: str = Field(..., description="Date of loss in ISO format YYYY-MM-DD")
    claim_type: Optional[str] = Field(None, description="Type of claim such as auto, property, liability")
    amount_claimed: Optional[float] = Field(None, description="Claimed amount if present")
    confidence: float = Field(..., ge=0.0, le=1.0)

2) Build the LangGraph workflow

This pattern uses StateGraph, add_node, add_edge, and add_conditional_edges. The extraction node calls an LLM through LangChain’s ChatOpenAI, while the validator decides whether to continue or route to review.

from typing import TypedDict
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
import json

class ExtractionState(TypedDict):
    document_text: str
    extracted: dict
    validated: bool
    needs_review: bool

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def extract_node(state: ExtractionState):
    prompt = f"""
Extract insurance fields from this document and return only valid JSON:
{state['document_text']}
"""
    response = llm.invoke([HumanMessage(content=prompt)])
    data = json.loads(response.content)
    return {"extracted": data}

def validate_node(state: ExtractionState):
    extracted = state["extracted"]
    required_fields = ["policy_number", "claimant_name", "loss_date", "confidence"]
    missing = [f for f in required_fields if f not in extracted or extracted[f] in ("", None)]
    needs_review = len(missing) > 0 or extracted.get("confidence", 0) < 0.85
    return {"validated": not needs_review, "needs_review": needs_review}

def route_after_validation(state: ExtractionState):
    return "review" if state["needs_review"] else END

def review_node(state: ExtractionState):
    # Human-in-the-loop placeholder; persist to queue in production
    print("Send to adjuster review:", state["extracted"])
    return {}

graph = StateGraph(ExtractionState)
graph.add_node("extract", extract_node)
graph.add_node("validate", validate_node)
graph.add_node("review", review_node)

graph.set_entry_point("extract")
graph.add_edge("extract", "validate")
graph.add_conditional_edges("validate", route_after_validation, {
    "review": "review",
    END: END,
})
graph.add_edge("review", END)

app = graph.compile()

3) Run the agent on a document

This is where your ingestion pipeline hands off OCR text from a scanned claim form or emailed attachment.

sample_doc = {
    "document_text": """
ACME Insurance Claim Form
Policy Number: POL-882193
Claimant Name: Jordan Smith
Date of Loss: 2025-01-14
Claim Type: Auto
Amount Claimed: $4200
Confidence indicators suggest high accuracy.
"""
}

result = app.invoke(sample_doc)
print(result)

4) Add persistence and audit logging

Insurance teams need traceability. At minimum log the source document hash, model name, extracted fields, validation outcome, and reviewer action. If you use LangGraph’s checkpointing in a real deployment with MemorySaver or a durable checkpointer backend, you can replay state transitions during audits and incident investigations.

Production Considerations

  • Data residency
    • Keep OCR text and extracted payloads inside approved regions. For regulated carriers and TPAs, do not send PII across jurisdictions without explicit controls.
  • Compliance logging
    • Store immutable audit records with timestamps, model version, prompt template version, and reviewer decisions. This is what legal and compliance teams will ask for after a disputed claim.
  • Guardrails on extraction
    • Reject outputs that violate schema constraints or contain unsupported values like malformed dates or negative claim amounts. Route those cases to human review instead of auto-posting into core systems.
  • Monitoring
    • Track extraction accuracy by document type: FNOL forms behave differently from medical bills or proof-of-loss packets. Watch confidence drift by carrier line of business and vendor OCR quality.

Common Pitfalls

  • Using raw LLM text without schema validation
    • This creates brittle integrations. Always parse into a Pydantic model or strict dict contract before writing to downstream systems.
  • Skipping human review for low-confidence cases
    • In insurance, one bad field can trigger wrong reserves or incorrect coverage decisions. Route uncertain documents to an adjuster queue.
  • Ignoring document provenance
    • If you cannot prove where the data came from and which model produced it, audits get messy fast. Persist source metadata alongside every extraction.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides