How to Build a document extraction Agent Using AutoGen in Python for pension funds

By Cyprian AaronsUpdated 2026-04-21
document-extractionautogenpythonpension-funds

A document extraction agent for pension funds reads inbound PDFs, scans, scanned statements, contribution schedules, beneficiary forms, and trustee packs, then turns them into structured records your downstream systems can validate and store. It matters because pension operations are document-heavy, error-sensitive, and audit-driven: a bad extraction can mean a wrong contribution posting, a compliance issue, or a delayed member update.

Architecture

  • Document intake layer

    • Pulls files from S3, SharePoint, email attachments, or an internal DMS.
    • Normalizes file metadata: source, received time, member ID, scheme ID, retention class.
  • OCR / text extraction layer

    • Uses PDF text extraction first.
    • Falls back to OCR for scanned forms and image-based statements.
    • Preserves page numbers for audit trails.
  • AutoGen agent layer

    • A AssistantAgent performs field extraction into a strict schema.
    • A UserProxyAgent executes Python tools for parsing, validation, and persistence.
    • Optional second agent reviews edge cases like missing NI numbers or inconsistent dates.
  • Validation and policy layer

    • Checks extracted values against pension-specific rules:
      • contribution totals
      • date ranges
      • member identifiers
      • mandatory disclosures
    • Flags low-confidence fields for human review.
  • Audit and storage layer

    • Stores raw text, extracted JSON, confidence scores, model version, and prompt version.
    • Writes immutable logs for compliance and dispute handling.

Implementation

1) Install AutoGen and define the schema

For production use, keep the output contract strict. Pension teams do not want free-form prose; they want JSON that can be validated.

from pydantic import BaseModel, Field
from typing import Optional, List

class PensionDocumentExtraction(BaseModel):
    document_type: str = Field(..., description="e.g. contribution_schedule, beneficiary_form")
    scheme_name: Optional[str] = None
    member_name: Optional[str] = None
    member_id: Optional[str] = None
    national_insurance_number: Optional[str] = None
    employer_name: Optional[str] = None
    period_start: Optional[str] = None
    period_end: Optional[str] = None
    total_contribution_gbp: Optional[float] = None
    pages_referenced: List[int] = []

2) Create the AutoGen agents

Use AssistantAgent for extraction logic and UserProxyAgent for tool execution. The proxy is where you keep deterministic Python checks.

import os
from autogen import AssistantAgent, UserProxyAgent

llm_config = {
    "config_list": [
        {
            "model": "gpt-4o-mini",
            "api_key": os.environ["OPENAI_API_KEY"],
        }
    ],
    "temperature": 0,
}

assistant = AssistantAgent(
    name="pension_extractor",
    llm_config=llm_config,
    system_message=(
        "Extract pension document fields into valid JSON only. "
        "Do not guess missing values. "
        "Return page references when possible."
    ),
)

user_proxy = UserProxyAgent(
    name="validator",
    human_input_mode="NEVER",
    code_execution_config={"work_dir": "./workdir", "use_docker": False},
)

3) Add a parsing tool and run the conversation

This pattern uses UserProxyAgent.register_function() so the assistant can call deterministic Python code. In practice you would add PDF text extraction here with PyMuPDF or OCR output from Tesseract/AWS Textract.

import json
from typing import Dict

def validate_pension_payload(payload: Dict) -> Dict:
    errors = []

    if payload.get("document_type") not in {
        "contribution_schedule",
        "beneficiary_form",
        "member_statement",
        "trustee_pack",
    }:
        errors.append("invalid_document_type")

    total = payload.get("total_contribution_gbp")
    if total is not None and total < 0:
        errors.append("negative_contribution")

    if payload.get("member_id") is None and payload.get("national_insurance_number") is None:
        errors.append("missing_member_identifier")

    return {"valid": len(errors) == 0, "errors": errors}

user_proxy.register_function(
    function_map={
        "validate_pension_payload": validate_pension_payload,
    }
)

message = """
Extract the key fields from this pension document text:

Page 1:
Scheme Name: Northshore Pension Plan
Member Name: Jane Smith
NI Number: AB123456C
Employer: Northshore Manufacturing Ltd
Period: 2024-01-01 to 2024-01-31
Total Contribution GBP: 425.50

Return JSON matching the schema.
"""

result = user_proxy.initiate_chat(
    assistant,
    message=message,
)
print(result.summary)

4) Post-process into validated records

The agent should not write directly to your pension admin system. Validate first, then persist only clean records with traceability.

from pydantic import ValidationError

raw_output = {
    "document_type": "contribution_schedule",
    "scheme_name": "Northshore Pension Plan",
    "member_name": "Jane Smith",
    "member_id": None,
    "national_insurance_number": "AB123456C",
    "employer_name": "Northshore Manufacturing Ltd",
    "period_start": "2024-01-01",
    "period_end": "2024-01-31",
    "total_contribution_gbp": 425.50,
    "pages_referenced": [1],
}

try:
    doc = PensionDocumentExtraction(**raw_output)
except ValidationError as e:
    raise RuntimeError(f"Schema validation failed: {e}")

check = validate_pension_payload(doc.model_dump())
if not check["valid"]:
    raise RuntimeError(f"Policy validation failed: {check['errors']}")

print(doc.model_dump_json(indent=2))

Production Considerations

  • Data residency

    • Keep extraction inside approved regions if your pension book requires UK/EU residency.
    • If you use hosted LLMs, confirm where prompts, outputs, and logs are stored.
  • Auditability

    • Persist raw document hash, extracted text hash, prompt version, model name, and response timestamp.
    • You need this when trustees ask why a field was interpreted a certain way.
  • Compliance guardrails

    • Block the agent from inventing missing member data.
    • Force “unknown” instead of guessing on NI numbers, dates of birth, or contribution values.
    • Route exceptions to human review before downstream posting.
  • Monitoring

    • Track field-level accuracy by document type.
    • Alert on spikes in low-confidence extractions for specific employers or schemes.
    • Log every tool call from UserProxyAgent for operational review.

Common Pitfalls

  1. Letting the model free-write records

    • Fix it by forcing structured JSON output and validating with Pydantic before storage.
    • Pension ops needs machine-checkable fields, not narrative summaries.
  2. Skipping OCR/page mapping

    • If you lose page references, audit becomes painful.
    • Store page numbers per extracted field so reviewers can jump straight to source evidence.
  3. Writing directly into core pension systems

    • Do not let the agent post contributions or update member records without validation.
    • Put a policy gate between extraction and mutation so compliance teams can review exceptions.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides