How to Build a document extraction Agent Using LangChain in Python for insurance

By Cyprian AaronsUpdated 2026-04-21
document-extractionlangchainpythoninsurance

A document extraction agent for insurance reads incoming PDFs, scans, policy forms, claim letters, medical bills, and loss reports, then turns them into structured JSON your downstream systems can use. It matters because insurance ops still run on documents, and the difference between a manual queue and an automated extraction flow is usually measured in turnaround time, error rate, and compliance risk.

Architecture

  • Document ingestion layer

    • Pulls files from S3, SharePoint, email attachments, or a claims intake API.
    • Normalizes file types before extraction.
  • Text extraction layer

    • Uses PyPDFLoader, UnstructuredFileLoader, or OCR-backed preprocessing for scanned documents.
    • Preserves page boundaries for auditability.
  • Schema layer

    • Defines the exact fields you need: claimant name, policy number, date of loss, coverage type, invoice totals, adjuster notes.
    • Keeps the model output aligned with insurance workflows.
  • LLM extraction chain

    • Uses LangChain’s ChatOpenAI with structured output.
    • Converts raw text into validated Python objects.
  • Validation and routing layer

    • Checks required fields, formats, and confidence thresholds.
    • Sends incomplete cases to human review.
  • Audit storage layer

    • Stores source document references, extracted JSON, model version, prompt version, and timestamps.
    • Supports compliance reviews and dispute resolution.

Implementation

1) Load the document and define the extraction schema

For insurance work, don’t ask the model for “a summary.” Ask for specific fields that match your downstream claims or underwriting system. Use Pydantic so LangChain can validate outputs before they hit your database.

from typing import Optional
from pydantic import BaseModel, Field

class InsuranceExtraction(BaseModel):
    policy_number: Optional[str] = Field(default=None, description="Policy number from the document")
    claimant_name: Optional[str] = Field(default=None, description="Name of insured or claimant")
    date_of_loss: Optional[str] = Field(default=None, description="Date of loss in ISO format if available")
    claim_number: Optional[str] = Field(default=None, description="Claim reference number")
    total_amount: Optional[float] = Field(default=None, description="Total billed or claimed amount")
    document_type: str = Field(description="Type of document such as claim letter, invoice, police report")

2) Load text with LangChain loaders

For native PDFs use PyPDFLoader. For scanned files you’ll need OCR upstream; LangChain won’t magically read images inside a PDF without text extraction support.

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("sample_claim.pdf")
docs = loader.load()

full_text = "\n\n".join(
    f"পৃষ্ঠা {doc.metadata.get('page', 'unknown')}\n{doc.page_content}"
    for doc in docs
)

3) Build a structured extraction chain with ChatOpenAI

This is the core pattern. Use with_structured_output() so the model returns data matching your schema instead of free-form prose.

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You extract structured data from insurance documents. "
     "Only return fields supported by the source text. "
     "If a field is missing, leave it null."),
    ("user", "Extract data from this document:\n\n{text}")
])

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    api_key=os.environ["OPENAI_API_KEY"]
)

structured_llm = llm.with_structured_output(InsuranceExtraction)

chain = prompt | structured_llm

result = chain.invoke({"text": full_text})
print(result.model_dump())

That pattern is production-friendly because it gives you typed output and reduces parsing failures. It also makes prompt changes easier to track during audits.

4) Add validation and human review routing

Insurance extraction should not trust every result equally. If critical fields are missing or ambiguous, route the case to an adjuster or operations queue.

def needs_review(extraction: InsuranceExtraction) -> bool:
    required_fields = ["policy_number", "claimant_name", "document_type"]
    missing_required = any(getattr(extraction, f) in (None, "") for f in required_fields)
    suspicious_amount = extraction.total_amount is not None and extraction.total_amount < 0
    return missing_required or suspicious_amount

if needs_review(result):
    print("Route to human review")
else:
    print("Send to downstream claims system")

Production Considerations

  • Data residency

    • Keep document processing in-region if your insurer has jurisdictional constraints.
    • Pin model endpoints and storage buckets to approved regions only.
  • Auditability

    • Store the original file hash, extracted text hash, prompt version, model name, and response payload.
    • You need this when a claimant disputes an extracted value.
  • Compliance controls

    • Redact sensitive data like SSNs or medical identifiers before sending text to an LLM where required.
    • Make sure access controls align with least privilege and internal retention rules.
  • Monitoring

    • Track field-level null rates, human-review rates, latency per document type, and schema validation failures.
    • Spikes usually mean OCR drift or a bad prompt change.

Common Pitfalls

  1. Using free-form LLM output

    • Don’t parse paragraphs with regex after the fact.
    • Use with_structured_output() plus Pydantic validation so malformed responses fail fast.
  2. Ignoring scanned-document quality

    • If half your intake is scanned PDFs or photos of faxed forms, plain PDF loaders will miss critical text.
    • Put OCR in front of LangChain and measure OCR confidence separately.
  3. Skipping insurance-specific field rules

    • A generic extractor will happily return “something” for every field.
    • Define strict rules for policy numbers, dates of loss, claim IDs, and monetary values so bad data doesn’t enter claims processing.
  4. No audit trail

    • If you cannot reproduce what the agent saw and what it returned at a given time, you will have problems during compliance reviews.
    • Persist source metadata alongside every extracted record.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides