How to Build a document extraction Agent Using LangGraph in Python for pension funds

By Cyprian AaronsUpdated 2026-04-21
document-extractionlanggraphpythonpension-funds

A document extraction agent for pension funds takes incoming PDFs, scans, statements, benefit forms, contribution schedules, and regulatory filings, then extracts structured fields into systems that downstream teams can trust. It matters because pension operations are compliance-heavy, high-volume, and audit-sensitive: bad extraction means bad member records, delayed benefits, and avoidable regulatory risk.

Architecture

  • Document ingestion layer

    • Accepts PDFs, scans, email attachments, and batch uploads.
    • Normalizes file metadata: source system, jurisdiction, fund ID, retention class.
  • OCR and text normalization

    • Converts scanned pages into text.
    • Preserves page numbers, bounding hints, and confidence scores for auditability.
  • Field extraction node

    • Uses an LLM or rules-backed parser to extract pension-specific fields:
      • member name
      • policy/member number
      • contribution amounts
      • employer name
      • effective dates
      • benefit type
    • Returns structured JSON only.
  • Validation and policy checks

    • Verifies required fields, date formats, numeric ranges, and cross-field consistency.
    • Flags missing consent markers or jurisdiction-specific issues.
  • Human review branch

    • Routes low-confidence or high-risk documents to an operations queue.
    • Keeps a full trace of model output and validation failures.
  • Persistence and audit trail

    • Stores extracted data, source document hash, prompts, model version, and decision path.
    • Supports retention and data residency requirements.

Implementation

1) Define the state and graph nodes

LangGraph works best when the state is explicit. For pension documents, keep the raw text, extracted fields, validation errors, and a routing flag in the state so every decision is traceable.

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages

class DocState(TypedDict):
    document_text: str
    extracted: dict
    validation_errors: list[str]
    needs_review: bool
    messages: Annotated[list, add_messages]

def extract_fields(state: DocState) -> DocState:
    text = state["document_text"]

    # Replace this with your LLM call or rules engine.
    extracted = {
        "member_name": "Jane Doe",
        "member_number": "PF-10293",
        "contribution_amount": 1250.50,
        "effective_date": "2025-01-31",
    }

    return {
        **state,
        "extracted": extracted,
        "validation_errors": [],
        "needs_review": False,
    }

def validate_fields(state: DocState) -> DocState:
    errors = []
    extracted = state.get("extracted", {})

    if not extracted.get("member_name"):
        errors.append("missing_member_name")
    if not extracted.get("member_number"):
        errors.append("missing_member_number")

    needs_review = len(errors) > 0

    return {
        **state,
        "validation_errors": errors,
        "needs_review": needs_review,
    }

2) Add a review branch with add_conditional_edges

This is where LangGraph becomes useful for production workflows. High-risk pension docs should not auto-post into core systems; they should route to review when validation fails or confidence is low.

def route_after_validation(state: DocState) -> str:
    return "review" if state["needs_review"] else "complete"

def human_review(state: DocState) -> DocState:
    # In production this would create a task in your case management system.
    reviewed = dict(state["extracted"])
    reviewed["review_status"] = "queued"
    return {**state, "extracted": reviewed}

graph = StateGraph(DocState)
graph.add_node("extract_fields", extract_fields)
graph.add_node("validate_fields", validate_fields)
graph.add_node("human_review", human_review)

graph.add_edge(START, "extract_fields")
graph.add_edge("extract_fields", "validate_fields")
graph.add_conditional_edges(
    "validate_fields",
    route_after_validation,
    {
        "review": "human_review",
        "complete": END,
    },
)
graph.add_edge("human_review", END)

app = graph.compile()

3) Invoke the graph with document input

You can run the compiled graph synchronously for batch jobs or wrap it in an API for real-time intake. For pension funds, batch processing is common because many source systems still produce end-of-day files.

result = app.invoke(
    {
        "document_text": """
            Pension contribution schedule
            Member: Jane Doe
            Member No: PF-10293
            Contribution: 1250.50
            Effective Date: 2025-01-31
        """,
        "extracted": {},
        "validation_errors": [],
        "needs_review": False,
        "messages": [],
    }
)

print(result["extracted"])
print(result["validation_errors"])

4) Extend with an actual LLM extraction step

For real extraction quality you’ll usually call an LLM with a strict schema prompt. Keep the output constrained to JSON and validate it before anything reaches downstream pension admin systems.

from pydantic import BaseModel

class PensionExtraction(BaseModel):
    member_name: str | None = None
    member_number: str | None = None
    contribution_amount: float | None = None
    effective_date: str | None = None

def llm_extract(state: DocState) -> DocState:
    text = state["document_text"]

    # Replace with your model call.
    parsed = PensionExtraction(
        member_name="Jane Doe",
        member_number="PF-10293",
        contribution_amount=1250.50,
        effective_date="2025-01-31",
    )

    return {
        **state,
        "extracted": parsed.model_dump(),
        "validation_errors": [],
        "needs_review": False,
    }

Production Considerations

  • Data residency

    • Keep document storage and model inference inside the required region.
    • For pension funds operating across jurisdictions, separate EU/UK/AU workloads from US workloads.
  • Audit trail

    • Persist the original document hash, extraction result, validation errors, model version, and graph path taken.
    • Auditors will ask why a record was auto-approved or routed to review.
  • Guardrails

    • Reject outputs that are not valid JSON or fail schema checks.
    • Never let the agent infer missing legal identifiers like tax IDs or membership numbers without source evidence.
  • Monitoring

    • Track field-level accuracy by document type.
    • Alert on spikes in review rate, OCR failures, or changes in extraction confidence after model updates.

Common Pitfalls

  1. Treating extraction as a single prompt-response call

    • That breaks quickly once you need validation and exception handling.
    • Use a LangGraph workflow so extraction, validation, routing, and review are separate nodes.
  2. Skipping schema enforcement

    • Free-form text output will leak into production systems sooner than you expect.
    • Parse into Pydantic models or JSON Schema before writing anything to the pension admin database.
  3. Ignoring operational context

    • A contribution schedule from one fund is not the same as a benefit claim form from another jurisdiction.
    • Build document-type routing early so each template gets its own extraction logic and controls.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides