How to Build a document extraction Agent Using CrewAI in Python for pension funds

By Cyprian AaronsUpdated 2026-04-21
document-extractioncrewaipythonpension-funds

A document extraction agent for pension funds reads statements, contribution schedules, trustee reports, benefit letters, and regulatory filings, then turns them into structured data your downstream systems can trust. It matters because pension operations are document-heavy, accuracy-sensitive, and audit-driven: one bad extraction can break member records, compliance reporting, or actuarial inputs.

Architecture

  • Ingestion layer

    • Accept PDFs, scanned images, emails, and zip bundles from secure storage or a case-management system.
    • Normalize filenames, metadata, and source identifiers for audit trails.
  • OCR and text normalization

    • Use OCR for scanned pension documents.
    • Clean headers, footers, page numbers, and duplicated table fragments before extraction.
  • CrewAI extraction crew

    • A Crew with specialized Agents for:
      • document classification
      • field extraction
      • validation against pension-specific rules
      • exception handling for low-confidence records
  • Schema and validation layer

    • Enforce output against a strict JSON schema.
    • Validate fields like member ID format, contribution dates, fund codes, employer references, and currency values.
  • Audit logging

    • Persist source document hashes, extracted fields, model version, prompts, and confidence scores.
    • Keep a replayable trail for compliance review.
  • Human review queue

    • Route ambiguous or policy-sensitive documents to an operations analyst.
    • Never auto-post low-confidence pension transactions.

Implementation

1) Install CrewAI and define your extraction schema

For pension funds, don’t start with “extract everything.” Start with the exact fields you need for operations and compliance. Keep the schema narrow so the agent is easier to validate and easier to audit.

pip install crewai pydantic python-dotenv
from pydantic import BaseModel, Field
from typing import Optional

class PensionExtraction(BaseModel):
    document_type: str = Field(description="Type of pension document")
    member_id: Optional[str] = Field(default=None)
    employer_name: Optional[str] = Field(default=None)
    fund_name: Optional[str] = Field(default=None)
    contribution_period: Optional[str] = Field(default=None)
    contribution_amount: Optional[float] = Field(default=None)
    currency: Optional[str] = Field(default=None)
    confidence: float = Field(ge=0.0, le=1.0)

2) Create agents with clear responsibilities

Use one agent to extract and another to validate. That separation matters in production because extraction quality and business-rule validation are not the same problem.

from crewai import Agent

extractor = Agent(
    role="Pension Document Extractor",
    goal="Extract structured pension data from raw document text",
    backstory="You specialize in pension fund statements, contribution reports, and benefit letters.",
    verbose=True,
)

validator = Agent(
    role="Pension Data Validator",
    goal="Validate extracted pension data against business rules",
    backstory="You check member IDs, dates, currencies, and contribution logic for consistency.",
    verbose=True,
)

3) Build tasks and run the crew

This pattern uses Task, Crew, and Process.sequential. Sequential processing is a good default when validation depends on prior extraction output.

from crewai import Task, Crew, Process

document_text = """
Employer: Acme Manufacturing Ltd
Fund: Horizon Pension Fund
Member ID: PF-102938
Contribution Period: 2024-01
Contribution Amount: ZAR 12500.00
"""

extract_task = Task(
    description=(
        "Extract pension document fields from the provided text. "
        "Return a JSON object matching the PensionExtraction schema."
    ),
    expected_output="A structured JSON object with extracted pension fields.",
    agent=extractor,
)

validate_task = Task(
    description=(
        "Review the extracted JSON for pension-fund correctness. "
        "Flag missing member IDs, invalid currencies, or suspicious amounts."
    ),
    expected_output="A validation summary with any issues found.",
    agent=validator,
)

crew = Crew(
    agents=[extractor, validator],
    tasks=[extract_task, validate_task],
    process=Process.sequential,
    verbose=True,
)

result = crew.kickoff(inputs={"document_text": document_text})
print(result)

4) Add a production wrapper around parsing and review routing

In real deployments you should parse the output into your schema and route low-confidence cases to humans. For pension funds this is where you protect against bad postings and weak auditability.

import json

def should_route_for_review(extraction_dict):
    if extraction_dict.get("confidence", 0) < 0.85:
        return True
    if not extraction_dict.get("member_id"):
        return True
    if extraction_dict.get("currency") not in {"ZAR", "USD", "GBP"}:
        return True
    return False

raw_output = str(result)
# In practice parse structured output from your task response format.
# Keep this wrapper explicit so you can log failures cleanly.
print(raw_output)

# Example routing decision once parsed:
# parsed = PensionExtraction.model_validate_json(raw_json_string).model_dump()
# if should_route_for_review(parsed):
#     send_to_human_queue(parsed)

Production Considerations

  • Data residency

    • Keep documents in-region if your pension regulator or fund policy requires it.
    • Don’t send member data across jurisdictions without legal review.
  • Auditability

    • Log the original file hash, extracted JSON, prompt version, model name, timestamp, and reviewer outcome.
    • You need this when trustees ask how a number was produced.
  • Guardrails

    • Reject outputs that fail schema validation or violate domain rules.
    • Examples: negative contributions without reversal context, invalid member IDs, or currency mismatches.
  • Monitoring

    • Track extraction accuracy by document type: contribution schedules behave differently from benefit statements.
    • Watch human-review rate; if it spikes after a template change from an employer or administrator vendor, retrain prompts or add examples.

Common Pitfalls

  1. Using one generic prompt for every document type

    • Pension documents vary a lot.
    • Fix it by classifying first: statement vs contribution schedule vs trustee report vs regulatory filing.
  2. Skipping deterministic validation

    • LLMs are not your control plane.
    • Fix it with Pydantic schemas plus hard business rules before anything reaches the ledger or admin system.
  3. Ignoring audit requirements

    • If you can’t explain an extraction later, it’s not production-ready for pensions.
    • Fix it by storing source hashes, model metadata, extracted payloads, and reviewer actions in immutable logs.
  4. Auto-accepting low-confidence results

    • A single wrong member ID can create downstream reconciliation work.
    • Fix it by routing uncertain outputs to operations staff instead of posting automatically.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides