How to Build a document extraction Agent Using LlamaIndex in Python for pension funds
A document extraction agent for pension funds takes messy PDFs, scans, statements, trustee packs, and policy docs, then turns them into structured fields your downstream systems can use. That matters because pension operations run on accuracy, traceability, and compliance: contribution schedules, member records, benefit calculations, and regulatory filings all need clean extraction with an audit trail.
Architecture
- •
Document ingestion layer
- •Pulls PDFs, scans, and Office docs from S3, SharePoint, SFTP, or an internal file store.
- •Normalizes files into text or image-backed pages before extraction.
- •
LlamaIndex parsing layer
- •Uses
SimpleDirectoryReaderfor local batches or custom loaders for enterprise sources. - •Converts documents into
Documentobjects with metadata like source system, upload time, and retention class.
- •Uses
- •
Extraction engine
- •Uses
PydanticProgramExtractororLLMTextCompletionProgrampatterns to map unstructured text into a strict schema. - •Extracts fields like employer name, scheme ID, contribution period, member count, and totals.
- •Uses
- •
Validation and rules layer
- •Checks extracted values against pension-specific rules.
- •Flags missing trustee signatures, invalid dates, negative contributions, or mismatched totals.
- •
Audit storage
- •Stores raw document references, extracted JSON, model version, prompt version, and timestamps.
- •Supports review workflows for compliance teams.
- •
Human review queue
- •Sends low-confidence or rule-failed extractions to an operator before final posting.
Implementation
1) Define the extraction schema
Use Pydantic to lock the output shape. In pension operations, free-form JSON is a liability; strict models make validation and audit easier.
from pydantic import BaseModel, Field
from typing import Optional
class PensionContributionRecord(BaseModel):
scheme_name: str = Field(..., description="Name of the pension scheme")
employer_name: str = Field(..., description="Employer submitting the contribution file")
scheme_id: Optional[str] = Field(None, description="Internal or regulatory scheme identifier")
contribution_period: str = Field(..., description="Payroll or contribution period")
total_employee_contributions: float = Field(..., ge=0)
total_employer_contributions: float = Field(..., ge=0)
currency: str = Field(..., description="ISO currency code")
submission_date: Optional[str] = None
2) Load documents with LlamaIndex
For local batches of pension documents during development or back-office processing:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_dir="./pension_docs",
recursive=True
).load_data()
If you need metadata for auditability, attach it before extraction:
for doc in documents:
doc.metadata["source_system"] = "sharepoint"
doc.metadata["retention_class"] = "pension_ops_7y"
3) Build the extractor with PydanticProgramExtractor
This is the core pattern. It gives you structured output instead of brittle prompt parsing.
from llama_index.core.program import PydanticProgramExtractor
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini", temperature=0)
extractor = PydanticProgramExtractor.from_defaults(
llm=llm,
output_cls=PensionContributionRecord,
prompt_template_str=(
"Extract the pension contribution record from the document.\n"
"Return only fields that match the schema.\n"
"If a field is missing and optional, return null.\n"
"Do not invent values."
),
)
results = extractor.extract(documents)
for record in results:
print(record.model_dump())
That pattern works because PydanticProgramExtractor enforces structure at the model boundary. For pension funds, that’s where you want strictness: before data hits finance systems or member records.
4) Add validation and routing
After extraction, apply deterministic checks before anything is posted downstream.
def validate_record(record: PensionContributionRecord) -> list[str]:
errors = []
if record.currency not in {"GBP", "EUR", "USD"}:
errors.append("Unsupported currency")
if record.total_employee_contributions == 0 and record.total_employer_contributions == 0:
errors.append("Both contribution totals are zero")
if not record.scheme_id:
errors.append("Missing scheme_id")
return errors
for record in results:
issues = validate_record(record)
if issues:
print({"status": "review", "issues": issues, "record": record.model_dump()})
else:
print({"status": "approved", "record": record.model_dump()})
This split is important. Pension operations should never rely on LLM output alone when a simple business rule can catch bad data.
Production Considerations
- •
Data residency
- •Keep document processing in-region if your fund has UK/EU residency requirements.
- •Use approved model endpoints and avoid sending raw member data to unvetted third-party services.
- •
Auditability
- •Store document hashes, extraction timestamps, model name/version, prompt version, and final JSON output.
- •If compliance asks why a field was extracted a certain way, you need reproducibility.
- •
Access control
- •Restrict who can upload documents, view extracted data, and trigger reprocessing.
- •Pension files often contain sensitive personal and financial information; treat them as regulated records.
- •
Monitoring
- •Track extraction failure rate, missing-field rate, manual-review rate, and drift by document type.
- •A sudden jump in review volume usually means a template changed upstream.
Common Pitfalls
- •
Using free-form text output instead of structured schemas
- •This leads to brittle parsing and silent failures.
- •Use
PydanticProgramExtractorwith explicit types so invalid outputs fail fast.
- •
Skipping metadata
- •Without source system IDs, document hashes, and retention tags, you lose audit traceability.
- •Attach metadata at ingestion time and persist it alongside extracted records.
- •
Trusting every extracted field equally
- •The model may get names right but miss totals or dates on scanned PDFs.
- •Validate critical numeric fields against business rules and route exceptions to humans.
If you build this as a structured pipeline instead of a chat demo, you get something pension teams can actually use: repeatable extraction, clear controls, and an audit trail that survives compliance review.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit