How to Build a document extraction Agent Using LangChain in Python for pension funds

By Cyprian AaronsUpdated 2026-04-21

document-extractionlangchainpythonpension-funds

A document extraction agent for pension funds reads PDFs, scans, and forms, then turns them into structured records you can route into downstream systems. It matters because pension operations are full of high-volume, high-stakes documents: member applications, beneficiary updates, contribution schedules, transfer forms, and trustee packs. If extraction is wrong, you get bad member data, compliance issues, and painful manual rework.

Architecture

A production-grade extraction agent for pension funds needs these components:

•
Document ingestion layer
- •Accepts PDFs, images, and scanned forms from S3, SharePoint, email drops, or internal portals.
- •Normalizes file metadata like source system, upload time, jurisdiction, and retention class.
•
Text extraction layer
- •Uses OCR for scanned documents and PDF parsing for digital files.
- •Preserves page numbers and bounding context so extracted fields can be audited later.
•
Schema-driven extractor
- •Converts unstructured text into a strict Python model.
- •Enforces pension-specific fields such as member ID, scheme name, contribution period, employer name, and beneficiary details.
•
Validation and guardrail layer
- •Checks required fields, date formats, policy rules, and cross-field consistency.
- •Flags low-confidence outputs for human review instead of auto-posting them.
•
Audit and persistence layer
- •Stores raw document references, extracted JSON, confidence scores, model version, and prompt version.
- •Keeps an immutable trail for compliance reviews and internal audits.

Implementation

1) Define the pension document schema

Use Pydantic so the agent returns structured data instead of free-form text. For pension funds, keep the schema tight: you want deterministic extraction with explicit nulls when a field is missing.

from typing import Optional
from pydantic import BaseModel, Field

class PensionDocument(BaseModel):
    document_type: str = Field(description="Type of pension document")
    scheme_name: Optional[str] = Field(default=None)
    member_id: Optional[str] = Field(default=None)
    member_name: Optional[str] = Field(default=None)
    employer_name: Optional[str] = Field(default=None)
    contribution_period: Optional[str] = Field(default=None)
    effective_date: Optional[str] = Field(default=None)
    beneficiary_name: Optional[str] = Field(default=None)
    notes: Optional[str] = Field(default=None)

2) Load the document with LangChain loaders

For PDFs stored locally or in object storage mounts, PyPDFLoader is a straightforward starting point. If you have scans, pair this with OCR upstream before LangChain sees the text.

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/pension_transfer_form.pdf")
pages = loader.load()

full_text = "\n\n".join(
    f"[Page {doc.metadata.get('page', '?')}] {doc.page_content}"
    for doc in pages
)
print(full_text[:2000])

3) Build a structured extraction chain with `ChatPromptTemplate` and `PydanticOutputParser`

This is the core pattern. The model gets instructions to extract only what is present in the document and return JSON that matches your schema.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import PydanticOutputParser
from langchain_openai import ChatOpenAI

parser = PydanticOutputParser(pydantic_object=PensionDocument)

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You extract structured data from pension fund documents. "
     "Return only fields supported by the schema. "
     "If a field is missing or unclear, use null."),
    ("human",
     "Extract data from this document:\n\n{document}\n\n{format_instructions}")
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

chain = prompt | llm | parser

result = chain.invoke({
    "document": full_text,
    "format_instructions": parser.get_format_instructions()
})

print(result.model_dump())

That pattern uses actual LangChain classes:

•ChatPromptTemplate
•PydanticOutputParser
•ChatOpenAI
•LCEL composition with |

4) Add validation before writing to your system of record

Pension workflows should not trust model output blindly. Validate required business rules before persisting anything into your admin platform or case management queue.

from datetime import datetime

def validate_pension_doc(doc: PensionDocument) -> list[str]:
    errors = []

    if not doc.member_id:
        errors.append("member_id is required")

    if not doc.document_type:
        errors.append("document_type is required")

    if doc.effective_date:
        try:
            datetime.fromisoformat(doc.effective_date)
        except ValueError:
            errors.append("effective_date must be ISO format YYYY-MM-DD")

    return errors

errors = validate_pension_doc(result)
if errors:
    print({"status": "review_required", "errors": errors})
else:
    print({"status": "approved", "data": result.model_dump()})

Production Considerations

•
Keep data residency explicit
- •Pension data often falls under regional storage rules.
- •Pin model endpoints and storage to approved jurisdictions.
- •Do not send documents to external services without legal approval and vendor review.
•
Log everything needed for audit
- •Store source file hash, page count, extraction timestamp, prompt version, model name, and validation outcome.
- •Keep raw text references separate from extracted records.
- •Auditors will ask how a specific field was derived months later.
•
Use human-in-the-loop thresholds
- •Auto-approve only when required fields are present and validation passes.
- •Route low-confidence or conflicting outputs to operations staff.
- •This is especially important for beneficiary changes and transfer requests.
•
Add role-based access controls
- •Member records include sensitive personal and financial data.
- •Restrict who can view raw documents versus extracted fields.
- •Separate extraction service permissions from downstream caseworker permissions.

Common Pitfalls

•
Using free-form prompts without a schema
- •Result: inconsistent output that breaks downstream systems.
- •Fix: use PydanticOutputParser or another strict structured output path every time.
•
Skipping OCR quality checks on scanned forms
- •Result: garbage in means garbage out.
- •Fix: detect low OCR confidence upstream and send those files directly to manual review before LLM extraction.
•
Auto-posting extracted values into core pension systems
- •Result: bad member data becomes operational debt fast.
- •Fix: add validation gates for dates, IDs, mandatory fields, and cross-field consistency before write-back.

If you build this as a strict extraction pipeline instead of a chatty assistant, it will hold up in pension operations. The winning pattern is simple: parse the document, extract into a schema, validate hard rules, then persist with an audit trail.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit