How to Build a document extraction Agent Using LangGraph in TypeScript for lending

By Cyprian AaronsUpdated 2026-04-21

document-extractionlanggraphtypescriptlending

A document extraction agent for lending takes incoming PDFs, scans, bank statements, pay stubs, tax returns, and IDs, then turns them into structured fields your underwriting system can trust. It matters because loan decisions depend on fast, accurate data extraction with a clean audit trail; if you miss income, misread liabilities, or lose provenance, you create compliance risk and bad credit decisions.

Architecture

•
Ingestion layer
- •Accepts PDFs and images from broker portals, LOS uploads, or email intake.
- •Normalizes file metadata like applicant ID, loan ID, jurisdiction, and document type.
•
Document classification node
- •Detects whether the file is a pay stub, W-2, bank statement, tax form, or ID.
- •Routes the document to the right extraction prompt and schema.
•
Extraction node
- •Uses an LLM or OCR-backed model to extract fields into a strict JSON shape.
- •Produces confidence scores and source references for each field.
•
Validation node
- •Checks required lending fields: name match, employer name, income totals, statement dates, account balances.
- •Flags inconsistencies for manual review instead of silently passing bad data.
•
Compliance and audit layer
- •Stores raw input hashes, extracted output, model version, prompt version, and decision trace.
- •Supports exam-ready auditability and retention policies.
•
Human review fallback
- •Escalates low-confidence or policy-sensitive documents to an operations queue.
- •Keeps adverse action or underwriting decisions from relying on weak extraction.

Implementation

1) Define the state and schemas

For lending workflows, keep the state explicit. You want a typed contract for what enters the graph, what gets extracted, and what gets flagged.

import { z } from "zod";
import { Annotation } from "@langchain/langgraph";

const DocumentTypeSchema = z.enum(["paystub", "bank_statement", "w2", "tax_return", "id", "unknown"]);

const ExtractedFieldsSchema = z.object({
  borrowerName: z.string().optional(),
  employerName: z.string().optional(),
  grossIncomeYTD: z.number().optional(),
  netPay: z.number().optional(),
  statementEndingBalance: z.number().optional(),
  statementStartDate: z.string().optional(),
  statementEndDate: z.string().optional(),
});

export const AgentState = Annotation.Root({
  filePath: Annotation<string>(),
  loanId: Annotation<string>(),
  jurisdiction: Annotation<string>(),
  documentType: Annotation<typeof DocumentTypeSchema._type>(),
  extractedFields: Annotation<z.infer<typeof ExtractedFieldsSchema>>(),
  confidence: Annotation<number>(),
  needsReview: Annotation<boolean>(),
});

This is the part most teams skip. In lending, loose state becomes loose controls.

2) Build the LangGraph workflow

Use StateGraph to route through classification, extraction, validation, and review. The graph below uses real LangGraph methods like addNode, addEdge, addConditionalEdges, and compile.

import { StateGraph, START, END } from "@langchain/langgraph";
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";

const llm = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });

async function classifyDocument(state: typeof AgentState.State) {
  const result = await llm.invoke([
    new HumanMessage(
      `Classify this lending document type. File path: ${state.filePath}. Return only one of paystub | bank_statement | w2 | tax_return | id | unknown.`
    ),
  ]);

  const text = result.content.toString().trim();
  return {
    documentType:
      text === "paystub" ||
      text === "bank_statement" ||
      text === "w2" ||
      text === "tax_return" ||
      text === "id"
        ? (text as any)
        : "unknown",
    confidence: text === "unknown" ? 0.3 : 0.9,
    needsReview: false,
    extractedFields: {},
  };
}

async function extractFields(state: typeof AgentState.State) {
  const prompt = `Extract structured fields from ${state.documentType} for loan ${state.loanId}.
Return JSON with borrowerName, employerName, grossIncomeYTD, netPay,
statementEndingBalance, statementStartDate, statementEndDate.`;

  const result = await llm.invoke([new HumanMessage(prompt)]);
  const parsed = JSON.parse(result.content.toString());

  return {
    extractedFields: parsed,
    confidence: Math.min(state.confidence ?? 1, parsed.confidence ?? 0.85),
    needsReview: false,
  };
}

async function validateExtraction(state: typeof AgentState.State) {
  const f = state.extractedFields ?? {};
  const missingCritical =
    state.documentType === "paystub" && (!f.borrowerName || !f.grossIncomeYTD || !f.netPay);

  const lowConfidence = (state.confidence ?? 0) < thresholdForDoc(state.documentType);

  return {
    needsReview: Boolean(missingCritical || lowConfidence),
    confidence: state.confidence,
    extractedFields: f,
    documentType: state.documentType,
    filePath: state.filePath,
    loanId: state.loanId,
    jurisdiction: state.jurisdiction,
  };
}

function thresholdForDoc(docType?: string) {
  switch (docType) {
    case "bank_statement":
      return .88;
    case "tax_return":
      return .92;
    default:
      return .85;
}
}

function routeAfterValidation(state: typeof AgentState.State) {
 return state.needsReview ? "review" : END;
}

const graph = new StateGraph(AgentState)
 .addNode("classify", classifyDocument)
 .addNode("extract", extractFields)
 .addNode("validate", validateExtraction)
 .addNode("review", async (state) => ({ ...state }))
 .addEdge(START, "classify")
 .addConditionalEdges("classify", (state) =>
   state.documentType === "unknown" ? "review" : "extract",
 )
 .addEdge("extract", "validate")
 .addConditionalEdges("validate", routeAfterValidation)
 .addEdge("review", END);

export const app = graph.compile();

The key pattern here is conditional routing based on document type and validation outcome. That gives you deterministic control over when a file can move forward in underwriting.

###3) Add an invocation wrapper with audit logging

In lending systems you need traceability per loan file. Store inputs and outputs with immutable metadata so compliance can reconstruct what happened later.

async function runExtraction() {
   const result = await app.invoke({
     filePath: "/data/loans/loan-123/paystub.pdf",
     loanId:"loan-123",
     jurisdiction:"US-CA",
     documentType:"unknown",
     extractedFields:{},
     confidence:0,
     needsReview:false,
   });

   await auditLog({
     loanId: result.loanId,
     jurisdiction: result.jurisdiction,
     documentType: result.documentType,
     confidence: result.confidence,
     needsReview': result.needsReview,
     extractedFieldsHash': hash(JSON.stringify(result.extractedFields)),
   });

   return result;
}

In production I’d also persist:

•model name
•prompt version
•OCR version
•operator overrides
•reviewer identity

That gives you a defensible chain of custody.

Production Considerations

•
Data residency

Keep files in-region if your lending program operates under local storage requirements. If a borrower’s documents are subject to regional controls, don’t ship raw PDFs across borders just to run extraction.
•
Compliance logging

Log every graph transition with timestamps and correlation IDs. For regulated lending flows you need to show why a document was accepted or routed to review.
•
Guardrails on extraction

Constraint-check numeric fields before they hit underwriting rules. Example:

if (result.extractedFields.grossIncomeYTD && result.extractedFields.grossIncomeYTD >10000000){
  throw new Error("Outlier income value requires manual review");
}

That prevents garbage OCR from becoming a credit decision.

•Monitoring

Track per-document-type accuracy by lender channel. A pay stub parser that works on one employer template may fail badly on another; monitor false positives by source bank or payroll provider.

Common Pitfalls

•Treating all documents the same

Pay stubs and tax returns have different field semantics. Avoid one generic extractor; route by document type first.

•Skipping human review thresholds

Low-confidence outputs should not auto-fill underwriting systems. Set hard thresholds per doc type and jurisdiction so borderline cases go to ops.

•Ignoring provenance

If you don’t store source references or hashes tied to each extraction run, you can’t defend the decision later. Keep immutable logs for every document processed.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit