How to Build a document extraction Agent Using LangGraph in TypeScript for retail banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionlanggraphtypescriptretail-banking

A document extraction agent for retail banking takes unstructured files like bank statements, payslips, utility bills, and ID scans, then turns them into structured fields your downstream systems can trust. It matters because onboarding, credit checks, and KYC workflows are still bottlenecked by manual review, and bad extraction creates compliance risk, operational cost, and customer drop-off.

Architecture

Build this agent with a small set of components that are easy to audit and reason about:

  • Document ingestion layer

    • Accepts PDFs, images, or text from your onboarding channel.
    • Normalizes file metadata like customer ID, source system, and region.
  • OCR / text extraction tool

    • Converts scanned documents into text.
    • For banking, keep the raw OCR output because it is part of the audit trail.
  • Extraction model node

    • Uses an LLM to map text into a strict schema.
    • Output should be typed fields like fullName, accountNumber, incomeAmount, and documentType.
  • Validation node

    • Checks required fields, format rules, confidence thresholds, and business rules.
    • Example: account numbers must match local banking formats; income cannot be negative.
  • Human review escalation

    • Routes low-confidence or policy-sensitive documents to an analyst.
    • This is mandatory for edge cases like mismatched names or suspicious alterations.
  • Audit logging and persistence

    • Stores input hashes, extracted outputs, validation results, model version, and reviewer actions.
    • Needed for compliance reviews and dispute handling.

Implementation

1. Define the state and schema

Use a typed graph state so every node reads and writes predictable data. In retail banking, that means you can trace exactly how a field was produced.

import { Annotation, StateGraph } from "@langchain/langgraph";
import { z } from "zod";

const DocumentSchema = z.object({
  documentType: z.enum(["bank_statement", "payslip", "utility_bill", "id_document"]),
  fullName: z.string().optional(),
  accountNumber: z.string().optional(),
  incomeAmount: z.number().optional(),
  address: z.string().optional(),
});

const GraphState = Annotation.Root({
  rawText: Annotation<string>(),
  extracted: Annotation<z.infer<typeof DocumentSchema> | null>(),
  validationErrors: Annotation<string[]>(),
  needsHumanReview: Annotation<boolean>(),
});

2. Add extraction and validation nodes

Keep the extraction prompt strict. Don’t ask the model to “be smart”; ask it to return only schema-shaped data. Then validate against business rules before anything reaches core banking workflows.

const extractNode = async (state: typeof GraphState.State) => {
  const text = state.rawText;

  // Replace this with your actual LLM call.
  // The important part is that the node returns structured data.
  const extracted = DocumentSchema.safeParse({
    documentType: "bank_statement",
    fullName: "Jane Doe",
    accountNumber: "1234567890",
    incomeAmount: undefined,
    address: undefined,
  });

  if (!extracted.success) {
    return {
      extracted: null,
      validationErrors: ["Schema validation failed"],
      needsHumanReview: true,
    };
  }

  return {
    extracted: extracted.data,
    validationErrors: [],
    needsHumanReview: false,
    rawText: text,
  };
};

const validateNode = async (state: typeof GraphState.State) => {
  const errors: string[] = [];
  const doc = state.extracted;

  if (!doc) {
    errors.push("No structured output returned");
    return { validationErrors: errors, needsHumanReview: true };
  }

  if (doc.documentType === "bank_statement" && !doc.accountNumber) {
    errors.push("Bank statement missing account number");
  }

  if (doc.documentType === "payslip" && typeof doc.incomeAmount !== "number") {
    errors.push("Payslip missing income amount");
  }

  return {
    validationErrors: errors,
    needsHumanReview: errors.length > zero ? true : state.needsHumanReview,
  };
};

3. Route low-confidence cases to human review

LangGraph’s conditional edges are the right fit here. If the document fails validation or confidence is too low, stop automation and hand off to an analyst queue.

const humanReviewNode = async (state: typeof GraphState.State) => {
	const doc = state.extracted;
	return {
		validationErrors: [...state.validationErrors, "Sent to human review"],
		needsHumanReview: true,
		extracted: doc,
	};
};

const workflow = new StateGraph(GraphState)
	.addNode("extract", extractNode)
	.addNode("validate", validateNode)
	.addNode("human_review", humanReviewNode)
	.addEdge("__start__", "extract")
	.addEdge("extract", "validate")
	.addConditionalEdges("validate", (state) => {
	  return state.needsHumanReview ? "human_review" : "__end__";
	}, {
	  human_review: "human_review",
	  __end__: "__end__",
	});

4. Compile and run the graph

Compile once at startup and reuse it across requests. In production banking systems, you want deterministic execution paths and stable model/version pinning.

const app = workflow.compile();

const result = await app.invoke({
	rawText:
	  "Jane Doe\nAccount Number: 1234567890\nStatement Date: Jan 2026\nBalance...",
	extracted: null,
	validationErrors: [],
	needsHumanReview: false,
});

console.log(result);

Production Considerations

  • Data residency

    • Keep OCR text and extracted fields in-region if your bank operates under local residency requirements.
    • Pin storage buckets, queues, and model endpoints to approved jurisdictions.
  • Auditability

    • Persist the full graph trace per document:
      • input hash
      • model name/version
      • extracted JSON
      • validation failures
      • reviewer decision
    • This is what internal audit will ask for when a loan or onboarding decision is challenged.
  • Guardrails

    • Reject free-form outputs. Only accept schema-valid JSON.
    • Add deterministic checks for known banking rules before any downstream write.
    • Block sensitive documents from being processed by non-approved models.
  • Monitoring

MetricWhy it mattersAction threshold
Extraction success rateMeasures how often documents become usable dataDrop below baseline triggers investigation
Human review rateShows model drift or bad scan qualitySpike indicates prompt/model regression
Field-level error rateCatches bad account numbers or namesAny rise on regulated fields needs rollback
Latency per documentImpacts onboarding SLAP95 above target requires scaling

Common Pitfalls

  • Using one generic prompt for every document type

Retail banking documents vary a lot. A payslip has different fields than a bank statement or utility bill. Split by document type first, then apply targeted extraction logic.

  • Skipping strict validation

If you let the model decide what “looks right,” you will ship bad data into KYC or credit systems. Validate field formats, required values, and cross-field consistency before acceptance.

  • Not preserving the raw source

You need the original OCR text and file hash for disputes and compliance reviews. Store both alongside the extracted payload so auditors can reproduce what happened.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides