How to Build a document extraction Agent Using LangGraph in Python for retail banking
A document extraction agent for retail banking takes inbound files like payslips, bank statements, IDs, proof of address, and loan forms, then routes them through validation, extraction, and policy checks. The goal is simple: reduce manual ops work while keeping compliance, auditability, and data handling tight enough for regulated workflows.
Architecture
- •
Ingress layer
- •Accepts PDFs, images, or scanned documents from a case management system or upload API.
- •Normalizes file metadata: customer ID, product type, jurisdiction, and retention policy.
- •
Document classifier
- •Identifies document type before extraction.
- •Routes statements, IDs, utility bills, and payslips to different extraction prompts or parsers.
- •
Extraction node
- •Pulls structured fields into a typed schema.
- •For retail banking this usually means name, address, account number mask, employer name, income figures, dates, and document validity markers.
- •
Validation and policy node
- •Checks extracted values against business rules.
- •Enforces KYC/AML constraints, residency rules, minimum completeness thresholds, and confidence cutoffs.
- •
Human review escalation
- •Sends low-confidence or policy-failing cases to an ops queue.
- •Keeps the agent from auto-approving anything that should be manually reviewed.
- •
Audit logger
- •Persists inputs, outputs, confidence scores, decision path, and model version.
- •This matters for regulator queries and internal dispute handling.
Implementation
1) Define the state and structured outputs
Use TypedDict for graph state and Pydantic for extraction output. In banking workflows you want typed fields because downstream systems should not consume free-form text.
from typing import TypedDict, Optional
from pydantic import BaseModel, Field
from langgraph.graph import StateGraph, START, END
from langchain_core.runnables import RunnableLambda
class DocumentState(TypedDict):
document_text: str
doc_type: Optional[str]
extracted: Optional[dict]
approved: bool
review_reason: Optional[str]
class BankDocument(BaseModel):
full_name: str = Field(..., description="Customer full legal name")
document_type: str = Field(..., description="ID card, bank statement, payslip")
issue_date: Optional[str] = None
expiry_date: Optional[str] = None
address: Optional[str] = None
account_last4: Optional[str] = None
2) Build the graph nodes with real LangGraph patterns
This example uses StateGraph, add_node, add_edge, add_conditional_edges, compile, and invoke. The extraction node is a placeholder function; in production you would call your OCR + LLM pipeline there.
from typing import Literal
def classify_document(state: DocumentState) -> dict:
text = state["document_text"].lower()
if "statement" in text:
doc_type = "bank_statement"
elif "payslip" in text or "salary" in text:
doc_type = "payslip"
elif "passport" in text or "driver license" in text:
doc_type = "id_document"
else:
doc_type = "unknown"
return {"doc_type": doc_type}
def extract_fields(state: DocumentState) -> dict:
# Replace with OCR + LLM structured extraction.
# Keep the output schema stable for downstream controls.
extracted = {
"full_name": "Jane Doe",
"document_type": state["doc_type"] or "unknown",
"issue_date": "2025-01-12",
"expiry_date": None,
"address": "12 Market Street",
"account_last4": "4821",
}
return {"extracted": extracted}
def validate_document(state: DocumentState) -> dict:
extracted = state.get("extracted") or {}
required_fields = ["full_name", "document_type"]
missing = [f for f in required_fields if not extracted.get(f)]
if missing:
return {"approved": False, "review_reason": f"Missing fields: {', '.join(missing)}"}
if state["doc_type"] == "unknown":
return {"approved": False, "review_reason": "Unsupported document type"}
return {"approved": True, "review_reason": None}
def route_after_validation(state: DocumentState) -> Literal["end", "review"]:
return "end" if state.get("approved") else "review"
graph = StateGraph(DocumentState)
graph.add_node("classify", RunnableLambda(classify_document))
graph.add_node("extract", RunnableLambda(extract_fields))
graph.add_node("validate", RunnableLambda(validate_document))
graph.add_edge(START, "classify")
graph.add_edge("classify", "extract")
graph.add_edge("extract", "validate")
graph.add_conditional_edges("validate", route_after_validation)
graph.add_edge("review", END)
graph.add_edge("end", END)
app = graph.compile()
result = app.invoke({
"document_text": "This bank statement shows salary credit...",
"doc_type": None,
"extracted": None,
"approved": False,
"review_reason": None,
})
print(result)
3) Add human review routing for low-confidence cases
Retail banking needs a hard stop when extraction quality is weak. Don’t guess on identity documents or income evidence; route to manual review when confidence is below threshold or fields conflict with policy.
A practical pattern is to extend the validation node with scoring:
def validate_document(state: DocumentState) -> dict:
extracted = state.get("extracted") or {}
confidence = extracted.get("confidence", 0.0)
if confidence < 0.85:
return {"approved": False,
"review_reason": f"Low confidence score: {confidence}"}
if not extracted.get("full_name"):
return {"approved": False,
"review_reason": "Missing customer name"}
return {"approved": True, "review_reason": None}
In production you can store the review reason alongside the source file hash so an auditor can trace exactly why the case was escalated.
4) Persist audit data outside the graph
LangGraph handles workflow orchestration; your compliance trail belongs in your storage layer. Log document hash, customer reference, model version, timestamps of each node execution, and final decision.
| Item | Why it matters |
|---|---|
| Source file hash | Proves which exact file was processed |
| Extraction output | Supports dispute resolution |
| Model/prompt version | Explains behavior changes over time |
| Review reason | Shows why human intervention happened |
| Jurisdiction tag | Helps enforce data residency rules |
Production Considerations
- •
Data residency
- •Keep OCR payloads and extracted PII inside the approved region.
- •If your bank operates across jurisdictions, route EU customer files to EU-hosted infrastructure only.
- •
Monitoring
- •Track extraction accuracy by document type.
- •Watch fallback-to-review rates; a sudden spike usually means OCR drift or prompt regression.
- •
Guardrails
- •Reject unsupported documents instead of trying to infer missing facts.
- •Mask sensitive values like account numbers before logs leave the secure boundary.
- •
Deployment
- •Version prompts and schemas together.
- •Roll out changes behind feature flags so compliance teams can compare old vs new behavior on sampled traffic.
Common Pitfalls
- •
Using untyped outputs
- •Free-form JSON from an LLM will break downstream systems sooner or later.
- •Use Pydantic models or strict post-processing so field names and formats stay stable.
- •
Auto-approving low-confidence extractions
- •This is how bad KYC decisions get into production.
- •Set explicit thresholds and send uncertain cases to human review every time.
- •
Skipping audit context
- •If you don’t store input hashes, node decisions, and model versions you cannot explain outcomes later.
- •In retail banking that becomes a compliance problem fast.
- •
Ignoring jurisdiction-specific handling
- •A single global bucket for all documents is a bad idea.
- •Separate storage paths by region and apply retention rules per market.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit