How to Build a document extraction Agent Using LangGraph in Python for banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionlanggraphpythonbanking

A document extraction agent for banking takes inbound files like PDFs, scans, and statements, extracts structured fields, validates them, and routes the result into downstream systems. It matters because banks live and die on throughput, accuracy, auditability, and compliance — if your extraction flow is brittle, every manual review becomes a cost center and every missed field becomes an operational risk.

Architecture

  • Input normalization

    • Accept PDF, image, or text documents.
    • Convert raw input into a consistent payload before extraction.
  • Extraction node

    • Call an OCR or document parsing model.
    • Pull out fields like customer name, account number, IBAN, statement date, transaction totals, and signatures.
  • Validation node

    • Enforce schema checks with pydantic.
    • Reject malformed outputs early instead of pushing bad data downstream.
  • Routing node

    • Decide whether the document is auto-approved, sent to manual review, or escalated for compliance.
    • This is where confidence thresholds and policy rules live.
  • Audit logging

    • Persist inputs, extracted outputs, validation results, and route decisions.
    • Banks need traceability for internal audit and regulator review.
  • State management

    • Keep the extraction result in a typed state object across nodes.
    • LangGraph’s StateGraph is a good fit because each step updates shared state explicitly.

Implementation

1. Define the state and schema

Use a typed state object for the graph and a strict schema for extracted fields. For banking workflows, don’t accept free-form JSON from the model without validation.

from typing import TypedDict, Optional
from pydantic import BaseModel, Field
from langgraph.graph import StateGraph, END

class ExtractedDocument(BaseModel):
    customer_name: str = Field(..., min_length=1)
    account_number: str = Field(..., min_length=6)
    document_type: str
    statement_date: Optional[str] = None
    total_amount: Optional[float] = None
    confidence: float = Field(..., ge=0.0, le=1.0)

class DocState(TypedDict):
    raw_text: str
    extracted: dict
    validated: bool
    route: str

2. Build extraction and validation nodes

In production you would call OCR or a document model here. The important pattern is that each node returns only the fields it owns. That keeps the graph debuggable and easy to audit.

def extract_fields(state: DocState):
    text = state["raw_text"]

    # Replace this with OCR/model output in real use.
    result = ExtractedDocument(
        customer_name="Jane Doe",
        account_number="1234567890",
        document_type="bank_statement",
        statement_date="2024-12-31",
        total_amount=1520.75,
        confidence=0.96,
    )

    return {"extracted": result.model_dump()}

def validate_fields(state: DocState):
    try:
        ExtractedDocument(**state["extracted"])
        return {"validated": True}
    except Exception:
        return {"validated": False}

3. Add routing logic for auto-process vs manual review

Banks need deterministic routing. If confidence is low or validation fails, send the item to review rather than trying to be clever.

def route_document(state: DocState):
    extracted = state["extracted"]
    confidence = extracted.get("confidence", 0.0)

    if not state["validated"]:
        return {"route": "manual_review"}

    if confidence < 0.90:
        return {"route": "manual_review"}

    return {"route": "auto_approved"}

4. Compile the LangGraph workflow

This is the actual LangGraph pattern: create a StateGraph, add nodes, set edges, compile it, then invoke it with state.

from langgraph.graph import StateGraph, END

workflow = StateGraph(DocState)

workflow.add_node("extract", extract_fields)
workflow.add_node("validate", validate_fields)
workflow.add_node("route", route_document)

workflow.set_entry_point("extract")
workflow.add_edge("extract", "validate")
workflow.add_edge("validate", "route")
workflow.add_edge("route", END)

app = workflow.compile()

result = app.invoke({
    "raw_text": "Sample bank statement text...",
    "extracted": {},
    "validated": False,
    "route": ""
})

print(result)

If you want branching instead of linear flow, use add_conditional_edges. That’s useful when you want manual review to trigger a separate enrichment path or case creation step.

Production Considerations

  • Audit trails

    • Store the raw input hash, extracted payload, validation outcome, graph route decision, and model version.
    • In banking audits, you need to explain why a record was auto-approved or flagged.
  • Data residency

    • Keep OCR/model processing inside approved regions.
    • If documents contain PII or account data, do not ship them to external endpoints without legal and security approval.
  • Monitoring

    • Track extraction accuracy by document type.
    • Watch rejection rates, manual review rates, latency per node, and drift in field-level confidence.
  • Guardrails

    • Enforce schema validation with pydantic.
    • Redact account numbers in logs.
    • Add policy checks for sanctioned entities, suspicious transaction references, and unsupported document types.

Common Pitfalls

  • Using untyped state everywhere

    • If every node passes around arbitrary dictionaries with no contract, debugging becomes painful fast.
    • Use a typed state plus strict output schemas so failures surface at the node boundary.
  • Trusting model output without validation

    • LLMs will occasionally invent fields or format dates incorrectly.
    • Always validate extracted values before routing them into core banking systems.
  • Skipping manual-review paths

    • A lot of teams build only the happy path.
    • In banking you need explicit fallback routes for low-confidence documents, unreadable scans, missing signatures, and compliance exceptions.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides