How to Build a document extraction Agent Using LangGraph in Python for lending

By Cyprian AaronsUpdated 2026-04-21
document-extractionlanggraphpythonlending

A document extraction agent for lending reads borrower documents, pulls out the fields your underwriting flow needs, and returns structured data with confidence and traceability. In practice, that means fewer manual reviews, faster application turnaround, and a cleaner audit trail when compliance asks how a decision was made.

Architecture

  • Document intake layer

    • Accepts PDFs, scans, and images from the loan application pipeline.
    • Stores the raw file in a controlled bucket with retention rules.
  • OCR and text normalization

    • Converts scanned pages into text.
    • Preserves page numbers and bounding context for later audit.
  • Extraction node

    • Uses an LLM or rules-based parser to extract lending fields like:
      • borrower name
      • employer
      • income
      • address
      • liabilities
      • document type
  • Validation and policy checks

    • Verifies required fields are present.
    • Flags mismatches between documents and application data.
    • Applies lending-specific rules such as minimum income evidence or stale-document rejection.
  • Human review handoff

    • Routes low-confidence or high-risk cases to an underwriter or ops reviewer.
    • Keeps the model out of final decisioning when confidence is weak.
  • Audit logger

    • Persists inputs, outputs, model version, timestamps, and validation results.
    • Supports compliance review and adverse action traceability.

Implementation

1. Define the state for extraction

Use a typed state object so every node in the graph knows what it can read and write. Keep raw text, extracted fields, confidence, and review flags separate.

from typing import TypedDict, List, Optional, Dict, Any

class ExtractionState(TypedDict):
    document_path: str
    raw_text: str
    extracted: Dict[str, Any]
    confidence: float
    issues: List[str]
    needs_review: bool

2. Build the LangGraph workflow

This pattern uses StateGraph, add_node, add_edge, add_conditional_edges, set_entry_point, and compile. The graph extracts text, parses fields, validates them, then routes to review if needed.

from langgraph.graph import StateGraph, END

def load_document(state: ExtractionState) -> ExtractionState:
    # Replace with OCR / PDF extraction in production
    with open(state["document_path"], "r", encoding="utf-8") as f:
        text = f.read()
    return {**state, "raw_text": text}

def extract_fields(state: ExtractionState) -> ExtractionState:
    text = state["raw_text"].lower()
    extracted = {
        "borrower_name": "Jane Doe" if "jane" in text else None,
        "employer": "Acme Corp" if "acme" in text else None,
        "monthly_income": 8500 if "$8500" in text else None,
        "document_type": "pay_stub" if "pay stub" in text else None,
    }
    confidence = 0.92 if all(extracted.values()) else 0.61
    return {**state, "extracted": extracted, "confidence": confidence}

def validate_extraction(state: ExtractionState) -> ExtractionState:
    issues = []
    extracted = state["extracted"]

    required = ["borrower_name", "monthly_income", "document_type"]
    for field in required:
        if not extracted.get(field):
            issues.append(f"missing_{field}")

    if state["confidence"] < 0.8:
        issues.append("low_confidence")

    needs_review = len(issues) > 0
    return {**state, "issues": issues, "needs_review": needs_review}

def route_after_validation(state: ExtractionState) -> str:
    return "human_review" if state["needs_review"] else END

def human_review(state: ExtractionState) -> ExtractionState:
    # In production this would create a task in your case management system.
    return {**state, "issues": state["issues"] + ["routed_to_underwriter"]}

graph = StateGraph(ExtractionState)
graph.add_node("load_document", load_document)
graph.add_node("extract_fields", extract_fields)
graph.add_node("validate_extraction", validate_extraction)
graph.add_node("human_review", human_review)

graph.set_entry_point("load_document")
graph.add_edge("load_document", "extract_fields")
graph.add_edge("extract_fields", "validate_extraction")
graph.add_conditional_edges(
    "validate_extraction",
    route_after_validation,
)

app = graph.compile()

3. Run the agent on a lending document

The compiled graph is callable with a state dict. For production you would pass a real file path from your ingestion service and replace the mock loader with OCR plus document classification.

result = app.invoke(
    {
        "document_path": "./sample_pay_stub.txt",
        "raw_text": "",
        "extracted": {},
        "confidence": 0.0,
        "issues": [],
        "needs_review": False,
    }
)

print(result["extracted"])
print(result["issues"])
print(result["needs_review"])

4. Add lending-specific guardrails

You want deterministic checks around extraction quality before any underwriting workflow consumes the result. For example: reject stale pay stubs older than 30 days, require at least one income source for salaried borrowers, and log every extraction version for audit.

def validate_lending_rules(state: ExtractionState) -> ExtractionState:
    issues = list(state["issues"])
    extracted = state["extracted"]

    if extracted.get("document_type") == "pay_stub" and extracted.get("monthly_income") is None:
        issues.append("income_missing_for_pay_stub")

    # Example policy hook: compare against application snapshot elsewhere in your system.
    # Keep this node pure; do not make approval decisions here.
    return {**state, "issues": issues, "needs_review": len(issues) > 0}

Production Considerations

  • Data residency

    • Keep borrower documents and OCR output in-region.
    • If you use external model APIs, confirm where prompts and files are processed and retained.
  • Auditability

    • Persist the full graph run ID, input hash, extracted JSON, validation issues, model version, and timestamp.
    • Underwriting teams need to reconstruct why a field was accepted or flagged.
  • Monitoring

    • Track field-level accuracy by document type: pay stubs, bank statements, tax returns.
    • Alert on spikes in needs_review, OCR failures, or missing critical fields like income or employer name.
  • Guardrails

    • Never let extraction output directly approve or decline an application.
    • Use hard validation thresholds plus human review for low-confidence cases or conflicting documents.

Common Pitfalls

  • Treating extraction as decisioning

    • Mistake: using the agent to approve loans based on inferred values.
    • Fix: keep it as a structured data capture layer; underwriting logic stays separate and explainable.
  • Ignoring document provenance

    • Mistake: losing page numbers, source file IDs, or OCR confidence scores.
    • Fix: carry provenance through the LangGraph state so every field can be traced back to source material.
  • Skipping schema validation

    • Mistake: passing free-form LLM output straight into downstream systems.
    • Fix: validate against a strict schema before writing to LOS/decision engines; reject partial or malformed payloads.
  • Deploying without review routing

    • Mistake: sending low-confidence extractions into automated pipelines.
    • Fix: add conditional edges in LangGraph so uncertain cases go to an ops queue instead of silently flowing through.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides