How to Build a document extraction Agent Using LangGraph in Python for pension funds
A document extraction agent for pension funds takes incoming PDFs, scans, statements, benefit forms, contribution schedules, and regulatory filings, then extracts structured fields into systems that downstream teams can trust. It matters because pension operations are compliance-heavy, high-volume, and audit-sensitive: bad extraction means bad member records, delayed benefits, and avoidable regulatory risk.
Architecture
- •
Document ingestion layer
- •Accepts PDFs, scans, email attachments, and batch uploads.
- •Normalizes file metadata: source system, jurisdiction, fund ID, retention class.
- •
OCR and text normalization
- •Converts scanned pages into text.
- •Preserves page numbers, bounding hints, and confidence scores for auditability.
- •
Field extraction node
- •Uses an LLM or rules-backed parser to extract pension-specific fields:
- •member name
- •policy/member number
- •contribution amounts
- •employer name
- •effective dates
- •benefit type
- •Returns structured JSON only.
- •Uses an LLM or rules-backed parser to extract pension-specific fields:
- •
Validation and policy checks
- •Verifies required fields, date formats, numeric ranges, and cross-field consistency.
- •Flags missing consent markers or jurisdiction-specific issues.
- •
Human review branch
- •Routes low-confidence or high-risk documents to an operations queue.
- •Keeps a full trace of model output and validation failures.
- •
Persistence and audit trail
- •Stores extracted data, source document hash, prompts, model version, and decision path.
- •Supports retention and data residency requirements.
Implementation
1) Define the state and graph nodes
LangGraph works best when the state is explicit. For pension documents, keep the raw text, extracted fields, validation errors, and a routing flag in the state so every decision is traceable.
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
class DocState(TypedDict):
document_text: str
extracted: dict
validation_errors: list[str]
needs_review: bool
messages: Annotated[list, add_messages]
def extract_fields(state: DocState) -> DocState:
text = state["document_text"]
# Replace this with your LLM call or rules engine.
extracted = {
"member_name": "Jane Doe",
"member_number": "PF-10293",
"contribution_amount": 1250.50,
"effective_date": "2025-01-31",
}
return {
**state,
"extracted": extracted,
"validation_errors": [],
"needs_review": False,
}
def validate_fields(state: DocState) -> DocState:
errors = []
extracted = state.get("extracted", {})
if not extracted.get("member_name"):
errors.append("missing_member_name")
if not extracted.get("member_number"):
errors.append("missing_member_number")
needs_review = len(errors) > 0
return {
**state,
"validation_errors": errors,
"needs_review": needs_review,
}
2) Add a review branch with add_conditional_edges
This is where LangGraph becomes useful for production workflows. High-risk pension docs should not auto-post into core systems; they should route to review when validation fails or confidence is low.
def route_after_validation(state: DocState) -> str:
return "review" if state["needs_review"] else "complete"
def human_review(state: DocState) -> DocState:
# In production this would create a task in your case management system.
reviewed = dict(state["extracted"])
reviewed["review_status"] = "queued"
return {**state, "extracted": reviewed}
graph = StateGraph(DocState)
graph.add_node("extract_fields", extract_fields)
graph.add_node("validate_fields", validate_fields)
graph.add_node("human_review", human_review)
graph.add_edge(START, "extract_fields")
graph.add_edge("extract_fields", "validate_fields")
graph.add_conditional_edges(
"validate_fields",
route_after_validation,
{
"review": "human_review",
"complete": END,
},
)
graph.add_edge("human_review", END)
app = graph.compile()
3) Invoke the graph with document input
You can run the compiled graph synchronously for batch jobs or wrap it in an API for real-time intake. For pension funds, batch processing is common because many source systems still produce end-of-day files.
result = app.invoke(
{
"document_text": """
Pension contribution schedule
Member: Jane Doe
Member No: PF-10293
Contribution: 1250.50
Effective Date: 2025-01-31
""",
"extracted": {},
"validation_errors": [],
"needs_review": False,
"messages": [],
}
)
print(result["extracted"])
print(result["validation_errors"])
4) Extend with an actual LLM extraction step
For real extraction quality you’ll usually call an LLM with a strict schema prompt. Keep the output constrained to JSON and validate it before anything reaches downstream pension admin systems.
from pydantic import BaseModel
class PensionExtraction(BaseModel):
member_name: str | None = None
member_number: str | None = None
contribution_amount: float | None = None
effective_date: str | None = None
def llm_extract(state: DocState) -> DocState:
text = state["document_text"]
# Replace with your model call.
parsed = PensionExtraction(
member_name="Jane Doe",
member_number="PF-10293",
contribution_amount=1250.50,
effective_date="2025-01-31",
)
return {
**state,
"extracted": parsed.model_dump(),
"validation_errors": [],
"needs_review": False,
}
Production Considerations
- •
Data residency
- •Keep document storage and model inference inside the required region.
- •For pension funds operating across jurisdictions, separate EU/UK/AU workloads from US workloads.
- •
Audit trail
- •Persist the original document hash, extraction result, validation errors, model version, and graph path taken.
- •Auditors will ask why a record was auto-approved or routed to review.
- •
Guardrails
- •Reject outputs that are not valid JSON or fail schema checks.
- •Never let the agent infer missing legal identifiers like tax IDs or membership numbers without source evidence.
- •
Monitoring
- •Track field-level accuracy by document type.
- •Alert on spikes in review rate, OCR failures, or changes in extraction confidence after model updates.
Common Pitfalls
- •
Treating extraction as a single prompt-response call
- •That breaks quickly once you need validation and exception handling.
- •Use a LangGraph workflow so extraction, validation, routing, and review are separate nodes.
- •
Skipping schema enforcement
- •Free-form text output will leak into production systems sooner than you expect.
- •Parse into Pydantic models or JSON Schema before writing anything to the pension admin database.
- •
Ignoring operational context
- •A contribution schedule from one fund is not the same as a benefit claim form from another jurisdiction.
- •Build document-type routing early so each template gets its own extraction logic and controls.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit