How to Build a document extraction Agent Using LangGraph in Python for banking
A document extraction agent for banking takes inbound files like PDFs, scans, and statements, extracts structured fields, validates them, and routes the result into downstream systems. It matters because banks live and die on throughput, accuracy, auditability, and compliance — if your extraction flow is brittle, every manual review becomes a cost center and every missed field becomes an operational risk.
Architecture
- •
Input normalization
- •Accept PDF, image, or text documents.
- •Convert raw input into a consistent payload before extraction.
- •
Extraction node
- •Call an OCR or document parsing model.
- •Pull out fields like customer name, account number, IBAN, statement date, transaction totals, and signatures.
- •
Validation node
- •Enforce schema checks with
pydantic. - •Reject malformed outputs early instead of pushing bad data downstream.
- •Enforce schema checks with
- •
Routing node
- •Decide whether the document is auto-approved, sent to manual review, or escalated for compliance.
- •This is where confidence thresholds and policy rules live.
- •
Audit logging
- •Persist inputs, extracted outputs, validation results, and route decisions.
- •Banks need traceability for internal audit and regulator review.
- •
State management
- •Keep the extraction result in a typed state object across nodes.
- •LangGraph’s
StateGraphis a good fit because each step updates shared state explicitly.
Implementation
1. Define the state and schema
Use a typed state object for the graph and a strict schema for extracted fields. For banking workflows, don’t accept free-form JSON from the model without validation.
from typing import TypedDict, Optional
from pydantic import BaseModel, Field
from langgraph.graph import StateGraph, END
class ExtractedDocument(BaseModel):
customer_name: str = Field(..., min_length=1)
account_number: str = Field(..., min_length=6)
document_type: str
statement_date: Optional[str] = None
total_amount: Optional[float] = None
confidence: float = Field(..., ge=0.0, le=1.0)
class DocState(TypedDict):
raw_text: str
extracted: dict
validated: bool
route: str
2. Build extraction and validation nodes
In production you would call OCR or a document model here. The important pattern is that each node returns only the fields it owns. That keeps the graph debuggable and easy to audit.
def extract_fields(state: DocState):
text = state["raw_text"]
# Replace this with OCR/model output in real use.
result = ExtractedDocument(
customer_name="Jane Doe",
account_number="1234567890",
document_type="bank_statement",
statement_date="2024-12-31",
total_amount=1520.75,
confidence=0.96,
)
return {"extracted": result.model_dump()}
def validate_fields(state: DocState):
try:
ExtractedDocument(**state["extracted"])
return {"validated": True}
except Exception:
return {"validated": False}
3. Add routing logic for auto-process vs manual review
Banks need deterministic routing. If confidence is low or validation fails, send the item to review rather than trying to be clever.
def route_document(state: DocState):
extracted = state["extracted"]
confidence = extracted.get("confidence", 0.0)
if not state["validated"]:
return {"route": "manual_review"}
if confidence < 0.90:
return {"route": "manual_review"}
return {"route": "auto_approved"}
4. Compile the LangGraph workflow
This is the actual LangGraph pattern: create a StateGraph, add nodes, set edges, compile it, then invoke it with state.
from langgraph.graph import StateGraph, END
workflow = StateGraph(DocState)
workflow.add_node("extract", extract_fields)
workflow.add_node("validate", validate_fields)
workflow.add_node("route", route_document)
workflow.set_entry_point("extract")
workflow.add_edge("extract", "validate")
workflow.add_edge("validate", "route")
workflow.add_edge("route", END)
app = workflow.compile()
result = app.invoke({
"raw_text": "Sample bank statement text...",
"extracted": {},
"validated": False,
"route": ""
})
print(result)
If you want branching instead of linear flow, use add_conditional_edges. That’s useful when you want manual review to trigger a separate enrichment path or case creation step.
Production Considerations
- •
Audit trails
- •Store the raw input hash, extracted payload, validation outcome, graph route decision, and model version.
- •In banking audits, you need to explain why a record was auto-approved or flagged.
- •
Data residency
- •Keep OCR/model processing inside approved regions.
- •If documents contain PII or account data, do not ship them to external endpoints without legal and security approval.
- •
Monitoring
- •Track extraction accuracy by document type.
- •Watch rejection rates, manual review rates, latency per node, and drift in field-level confidence.
- •
Guardrails
- •Enforce schema validation with
pydantic. - •Redact account numbers in logs.
- •Add policy checks for sanctioned entities, suspicious transaction references, and unsupported document types.
- •Enforce schema validation with
Common Pitfalls
- •
Using untyped state everywhere
- •If every node passes around arbitrary dictionaries with no contract, debugging becomes painful fast.
- •Use a typed state plus strict output schemas so failures surface at the node boundary.
- •
Trusting model output without validation
- •LLMs will occasionally invent fields or format dates incorrectly.
- •Always validate extracted values before routing them into core banking systems.
- •
Skipping manual-review paths
- •A lot of teams build only the happy path.
- •In banking you need explicit fallback routes for low-confidence documents, unreadable scans, missing signatures, and compliance exceptions.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit