How to Build a document extraction Agent Using CrewAI in Python for insurance

By Cyprian AaronsUpdated 2026-04-21
document-extractioncrewaipythoninsurance

A document extraction agent for insurance takes messy policy PDFs, claims forms, loss runs, and supporting evidence, then turns them into structured data your downstream systems can trust. The point is not just OCR; it is extracting the right fields, validating them against business rules, and producing an auditable result that can flow into underwriting, claims triage, or policy administration.

Architecture

  • Input layer

    • Accepts PDFs, scans, emails, or uploaded images.
    • Normalizes files into text-friendly artifacts before extraction.
  • Extraction agent

    • Uses an LLM-backed Agent to read the document and extract target fields.
    • Focuses on insurance entities like policy number, named insured, dates of loss, limits, deductibles, and claim status.
  • Task definitions

    • Each Task maps to one extraction objective.
    • Keep tasks narrow so outputs are easier to validate and audit.
  • Crew orchestration

    • A Crew coordinates the agent and tasks.
    • This gives you a clean execution boundary for retries, logging, and handoff to downstream services.
  • Validation layer

    • Parses the agent output into structured JSON.
    • Enforces schema rules and rejects incomplete or inconsistent results.
  • Audit and storage

    • Persists raw input references, extracted output, model version, timestamps, and reviewer overrides.
    • Required for compliance and dispute handling in insurance workflows.

Implementation

1. Install dependencies

Use CrewAI plus a document loader or PDF parser. For production insurance workflows, I prefer keeping extraction logic separate from file ingestion so you can swap OCR providers later.

pip install crewai crewai-tools pydantic pypdf

2. Define a strict schema for insurance extraction

Do not let the model invent fields. Use a Pydantic model to constrain the output you expect from the agent.

from pydantic import BaseModel, Field
from typing import Optional

class InsuranceDocumentExtraction(BaseModel):
    document_type: str = Field(..., description="Type of document such as claim form or policy declaration")
    policy_number: Optional[str] = Field(None, description="Insurance policy number")
    claim_number: Optional[str] = Field(None, description="Claim identifier")
    named_insured: Optional[str] = Field(None, description="Insured party name")
    date_of_loss: Optional[str] = Field(None, description="Date of loss in ISO format if available")
    effective_date: Optional[str] = Field(None, description="Policy effective date in ISO format if available")
    expiration_date: Optional[str] = Field(None, description="Policy expiration date in ISO format if available")
    carrier_name: Optional[str] = Field(None, description="Insurance carrier name")
    confidence_notes: str = Field(..., description="Short notes on ambiguity or missing data")

3. Build the CrewAI agent and task

This is the core pattern. The agent reads the document text and returns structured extraction aligned to the schema.

from crewai import Agent, Task, Crew
from crewai.project import CrewBase
from pypdf import PdfReader

def extract_text_from_pdf(path: str) -> str:
    reader = PdfReader(path)
    pages = []
    for page in reader.pages:
        pages.append(page.extract_text() or "")
    return "\n".join(pages)

document_text = extract_text_from_pdf("sample_claim.pdf")

extraction_agent = Agent(
    role="Insurance Document Extraction Specialist",
    goal="Extract structured insurance fields from documents with high accuracy and low hallucination",
    backstory=(
        "You work on insurance operations systems. "
        "You only extract fields explicitly supported by the source text. "
        "When data is missing or unclear, you mark it as null and explain why."
    ),
    verbose=True,
)

extraction_task = Task(
    description=(
        "Read the following insurance document text and extract the required fields. "
        "Return only data supported by the text.\n\n"
        f"DOCUMENT TEXT:\n{document_text}"
    ),
    expected_output=(
        "A JSON object matching this schema: "
        "{document_type, policy_number, claim_number, named_insured, "
        "date_of_loss, effective_date, expiration_date, carrier_name, confidence_notes}"
    ),
    agent=extraction_agent,
)

crew = Crew(
    agents=[extraction_agent],
    tasks=[extraction_task],
    verbose=True,
)

result = crew.kickoff()
print(result)

4. Parse and validate the output before saving

In insurance systems you should never write raw LLM output directly into your core system of record. Parse it first and fail closed if validation fails.

import json
from pydantic import ValidationError

raw_output = str(result)

try:
    parsed = json.loads(raw_output)
    extracted = InsuranceDocumentExtraction(**parsed)
except (json.JSONDecodeError, ValidationError) as e:
    raise RuntimeError(f"Extraction validation failed: {e}")

print(extracted.model_dump())

Production Considerations

  • Compliance

    • Store source document IDs, extracted output, model version, prompt version, and operator overrides.
    • That audit trail matters for claims disputes and regulatory review.
  • Data residency

    • Keep processing inside approved regions if documents contain PHI/PII or regulated claims data.
    • If your insurer operates across jurisdictions, route documents by tenant or region before calling any model endpoint.
  • Monitoring

    • Track extraction accuracy by field type: policy number accuracy is not the same as date-of-loss accuracy.
    • Add metrics for validation failures, null-field rates, retry counts, and human review overrides.
  • Guardrails

    • Reject outputs that contain unsupported values like invented claim numbers or dates not present in source text.
    • Use deterministic post-processing rules for formats such as ISO dates and policy number patterns.

Common Pitfalls

  • Using one giant prompt for every document type

    • Claim forms and declarations pages are different problems.
    • Split by document class first; then use a dedicated task per class.
  • Trusting free-form model output

    • If you do not validate with a schema like Pydantic’s BaseModel, bad data will leak into downstream systems.
    • Always parse before persistence.
  • Ignoring scan quality

    • Bad OCR destroys extraction quality faster than prompt tuning can fix it.
    • Add preprocessing for rotation correction, DPI normalization, and fallback OCR before CrewAI sees the text.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides