How to Build a document extraction Agent Using AutoGen in Python for insurance

By Cyprian AaronsUpdated 2026-04-21
document-extractionautogenpythoninsurance

A document extraction agent for insurance takes messy inputs like FNOL forms, claim letters, invoices, policy schedules, and medical reports, then turns them into structured JSON your downstream systems can trust. That matters because insurance workflows live or die on speed, accuracy, and auditability: if extraction is wrong, you misroute claims, delay settlement, or create compliance risk.

Architecture

Build this agent as a small pipeline, not a single prompt.

  • Document ingestion layer

    • Accept PDF, image, and text inputs from email, S3, SharePoint, or a claims portal.
    • Normalize files before they hit the LLM layer.
  • OCR / text extraction service

    • Use a deterministic OCR step for scanned documents.
    • Keep raw text alongside page coordinates when possible for audit trails.
  • AutoGen extraction agent

    • Use AssistantAgent to transform extracted text into structured insurance fields.
    • Constrain output to a schema the claims system understands.
  • Validation and policy rules engine

    • Check required fields like claimant name, loss date, policy number, and amounts.
    • Reject or flag outputs that violate business rules.
  • Human review queue

    • Route low-confidence or incomplete extractions to an adjuster or ops analyst.
    • Keep the original source document and model output side by side.
  • Audit store

    • Persist prompts, model responses, versioned schemas, and reviewer decisions.
    • This is non-negotiable for regulated insurance workflows.

Implementation

1) Install dependencies and define the extraction schema

Use AutoGen’s agent API with a strict JSON contract. For insurance extraction, keep the schema narrow and explicit so the model does not invent fields.

from pydantic import BaseModel, Field
from typing import Optional
import json

class ClaimExtraction(BaseModel):
    policy_number: Optional[str] = Field(default=None)
    claimant_name: Optional[str] = Field(default=None)
    loss_date: Optional[str] = Field(default=None)
    claim_type: Optional[str] = Field(default=None)
    amount_claimed: Optional[float] = Field(default=None)
    currency: Optional[str] = Field(default="USD")
    confidence: float = Field(ge=0.0, le=1.0)

def validate_extraction(payload: dict) -> ClaimExtraction:
    return ClaimExtraction(**payload)

2) Create an AutoGen assistant for extraction

This uses AssistantAgent from AutoGen. The key pattern is to make the agent act like a parser: no commentary, no reasoning dump, only JSON.

import os
from autogen import AssistantAgent

llm_config = {
    "model": "gpt-4o-mini",
    "api_key": os.environ["OPENAI_API_KEY"],
    "temperature": 0,
}

extractor = AssistantAgent(
    name="claim_extractor",
    llm_config=llm_config,
    system_message=(
        "You extract structured data from insurance documents. "
        "Return only valid JSON matching this schema: "
        "{policy_number, claimant_name, loss_date, claim_type, amount_claimed, currency, confidence}. "
        "If a field is missing, use null. Never invent values."
    ),
)

document_text = """
ACME Insurance
Policy No: P-883192
Claimant: Maria Santos
Loss Date: 2025-01-14
Claim Type: Water Damage
Amount Requested: $12,450.00
"""

message = f"""
Extract fields from this insurance document and return JSON only.

Document:
{document_text}
"""

response = extractor.generate_reply(messages=[{"role": "user", "content": message}])
print(response)

3) Parse and validate the model output

In production you should never trust raw LLM output. Parse it into your schema immediately and route failures to review.

def parse_json_response(text: str) -> dict:
    return json.loads(text)

raw_output = response if isinstance(response, str) else response["content"]
payload = parse_json_response(raw_output)
extraction = validate_extraction(payload)

print(extraction.model_dump())

4) Add a reviewer agent for exceptions

For ambiguous documents or low-confidence outputs, use a second AutoGen agent to ask clarifying questions or recommend human review. UserProxyAgent is useful when you want controlled interaction with an operator workflow.

from autogen import UserProxyAgent

reviewer = UserProxyAgent(
    name="claims_reviewer",
    human_input_mode="NEVER",
)

if extraction.confidence < 0.8 or extraction.policy_number is None:
    reviewer.initiate_chat(
        extractor,
        message=(
            "The previous extraction is incomplete or low confidence. "
            "Re-check the document and only return corrected JSON."
        ),
    )

Production Considerations

  • Data residency

    • Keep OCR text and extracted payloads in-region if your carrier operates under local data residency rules.
    • If documents contain PHI or regulated personal data, confirm where the model endpoint processes requests.
  • Auditability

    • Store every prompt, response, schema version, and reviewer override.
    • Regulators and internal audit teams will ask how a field was derived.
  • Guardrails

    • Enforce JSON schema validation before writing to claims systems.
    • Block unsupported fields and require human review when confidence drops below threshold.
  • Monitoring

    • Track extraction accuracy by document type: FNOL forms behave differently from repair invoices or medical bills.
    • Monitor latency separately for OCR time and LLM time so you know where failures start.

Common Pitfalls

  • Letting the model free-form answer

    • Mistake: asking for “extracted details” without enforcing structure.
    • Fix: require strict JSON and validate with Pydantic before any downstream use.
  • Skipping OCR normalization

    • Mistake: sending raw scans directly into the agent.
    • Fix: run OCR first and preserve page order; bad text input produces bad extraction no matter how good the prompt is.
  • Ignoring exception handling

    • Mistake: auto-posting every result into the claims platform.
    • Fix: add thresholds for missing fields and low confidence so ambiguous cases go to human review.

If you want this to hold up in an insurance environment, treat AutoGen as orchestration glue around deterministic parsing rules. The agent should extract fast; your validation layer should decide what is safe enough to trust.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides