How to Build a document extraction Agent Using LlamaIndex in Python for pension funds

By Cyprian AaronsUpdated 2026-04-21
document-extractionllamaindexpythonpension-funds

A document extraction agent for pension funds takes messy PDFs, scans, statements, trustee packs, and policy docs, then turns them into structured fields your downstream systems can use. That matters because pension operations run on accuracy, traceability, and compliance: contribution schedules, member records, benefit calculations, and regulatory filings all need clean extraction with an audit trail.

Architecture

  • Document ingestion layer

    • Pulls PDFs, scans, and Office docs from S3, SharePoint, SFTP, or an internal file store.
    • Normalizes files into text or image-backed pages before extraction.
  • LlamaIndex parsing layer

    • Uses SimpleDirectoryReader for local batches or custom loaders for enterprise sources.
    • Converts documents into Document objects with metadata like source system, upload time, and retention class.
  • Extraction engine

    • Uses PydanticProgramExtractor or LLMTextCompletionProgram patterns to map unstructured text into a strict schema.
    • Extracts fields like employer name, scheme ID, contribution period, member count, and totals.
  • Validation and rules layer

    • Checks extracted values against pension-specific rules.
    • Flags missing trustee signatures, invalid dates, negative contributions, or mismatched totals.
  • Audit storage

    • Stores raw document references, extracted JSON, model version, prompt version, and timestamps.
    • Supports review workflows for compliance teams.
  • Human review queue

    • Sends low-confidence or rule-failed extractions to an operator before final posting.

Implementation

1) Define the extraction schema

Use Pydantic to lock the output shape. In pension operations, free-form JSON is a liability; strict models make validation and audit easier.

from pydantic import BaseModel, Field
from typing import Optional

class PensionContributionRecord(BaseModel):
    scheme_name: str = Field(..., description="Name of the pension scheme")
    employer_name: str = Field(..., description="Employer submitting the contribution file")
    scheme_id: Optional[str] = Field(None, description="Internal or regulatory scheme identifier")
    contribution_period: str = Field(..., description="Payroll or contribution period")
    total_employee_contributions: float = Field(..., ge=0)
    total_employer_contributions: float = Field(..., ge=0)
    currency: str = Field(..., description="ISO currency code")
    submission_date: Optional[str] = None

2) Load documents with LlamaIndex

For local batches of pension documents during development or back-office processing:

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_dir="./pension_docs",
    recursive=True
).load_data()

If you need metadata for auditability, attach it before extraction:

for doc in documents:
    doc.metadata["source_system"] = "sharepoint"
    doc.metadata["retention_class"] = "pension_ops_7y"

3) Build the extractor with PydanticProgramExtractor

This is the core pattern. It gives you structured output instead of brittle prompt parsing.

from llama_index.core.program import PydanticProgramExtractor
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini", temperature=0)

extractor = PydanticProgramExtractor.from_defaults(
    llm=llm,
    output_cls=PensionContributionRecord,
    prompt_template_str=(
        "Extract the pension contribution record from the document.\n"
        "Return only fields that match the schema.\n"
        "If a field is missing and optional, return null.\n"
        "Do not invent values."
    ),
)

results = extractor.extract(documents)

for record in results:
    print(record.model_dump())

That pattern works because PydanticProgramExtractor enforces structure at the model boundary. For pension funds, that’s where you want strictness: before data hits finance systems or member records.

4) Add validation and routing

After extraction, apply deterministic checks before anything is posted downstream.

def validate_record(record: PensionContributionRecord) -> list[str]:
    errors = []

    if record.currency not in {"GBP", "EUR", "USD"}:
        errors.append("Unsupported currency")

    if record.total_employee_contributions == 0 and record.total_employer_contributions == 0:
        errors.append("Both contribution totals are zero")

    if not record.scheme_id:
        errors.append("Missing scheme_id")

    return errors

for record in results:
    issues = validate_record(record)
    if issues:
        print({"status": "review", "issues": issues, "record": record.model_dump()})
    else:
        print({"status": "approved", "record": record.model_dump()})

This split is important. Pension operations should never rely on LLM output alone when a simple business rule can catch bad data.

Production Considerations

  • Data residency

    • Keep document processing in-region if your fund has UK/EU residency requirements.
    • Use approved model endpoints and avoid sending raw member data to unvetted third-party services.
  • Auditability

    • Store document hashes, extraction timestamps, model name/version, prompt version, and final JSON output.
    • If compliance asks why a field was extracted a certain way, you need reproducibility.
  • Access control

    • Restrict who can upload documents, view extracted data, and trigger reprocessing.
    • Pension files often contain sensitive personal and financial information; treat them as regulated records.
  • Monitoring

    • Track extraction failure rate, missing-field rate, manual-review rate, and drift by document type.
    • A sudden jump in review volume usually means a template changed upstream.

Common Pitfalls

  • Using free-form text output instead of structured schemas

    • This leads to brittle parsing and silent failures.
    • Use PydanticProgramExtractor with explicit types so invalid outputs fail fast.
  • Skipping metadata

    • Without source system IDs, document hashes, and retention tags, you lose audit traceability.
    • Attach metadata at ingestion time and persist it alongside extracted records.
  • Trusting every extracted field equally

    • The model may get names right but miss totals or dates on scanned PDFs.
    • Validate critical numeric fields against business rules and route exceptions to humans.

If you build this as a structured pipeline instead of a chat demo, you get something pension teams can actually use: repeatable extraction, clear controls, and an audit trail that survives compliance review.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides