How to Build a document extraction Agent Using LangGraph in TypeScript for retail banking
A document extraction agent for retail banking takes unstructured files like bank statements, payslips, utility bills, and ID scans, then turns them into structured fields your downstream systems can trust. It matters because onboarding, credit checks, and KYC workflows are still bottlenecked by manual review, and bad extraction creates compliance risk, operational cost, and customer drop-off.
Architecture
Build this agent with a small set of components that are easy to audit and reason about:
- •
Document ingestion layer
- •Accepts PDFs, images, or text from your onboarding channel.
- •Normalizes file metadata like customer ID, source system, and region.
- •
OCR / text extraction tool
- •Converts scanned documents into text.
- •For banking, keep the raw OCR output because it is part of the audit trail.
- •
Extraction model node
- •Uses an LLM to map text into a strict schema.
- •Output should be typed fields like
fullName,accountNumber,incomeAmount, anddocumentType.
- •
Validation node
- •Checks required fields, format rules, confidence thresholds, and business rules.
- •Example: account numbers must match local banking formats; income cannot be negative.
- •
Human review escalation
- •Routes low-confidence or policy-sensitive documents to an analyst.
- •This is mandatory for edge cases like mismatched names or suspicious alterations.
- •
Audit logging and persistence
- •Stores input hashes, extracted outputs, validation results, model version, and reviewer actions.
- •Needed for compliance reviews and dispute handling.
Implementation
1. Define the state and schema
Use a typed graph state so every node reads and writes predictable data. In retail banking, that means you can trace exactly how a field was produced.
import { Annotation, StateGraph } from "@langchain/langgraph";
import { z } from "zod";
const DocumentSchema = z.object({
documentType: z.enum(["bank_statement", "payslip", "utility_bill", "id_document"]),
fullName: z.string().optional(),
accountNumber: z.string().optional(),
incomeAmount: z.number().optional(),
address: z.string().optional(),
});
const GraphState = Annotation.Root({
rawText: Annotation<string>(),
extracted: Annotation<z.infer<typeof DocumentSchema> | null>(),
validationErrors: Annotation<string[]>(),
needsHumanReview: Annotation<boolean>(),
});
2. Add extraction and validation nodes
Keep the extraction prompt strict. Don’t ask the model to “be smart”; ask it to return only schema-shaped data. Then validate against business rules before anything reaches core banking workflows.
const extractNode = async (state: typeof GraphState.State) => {
const text = state.rawText;
// Replace this with your actual LLM call.
// The important part is that the node returns structured data.
const extracted = DocumentSchema.safeParse({
documentType: "bank_statement",
fullName: "Jane Doe",
accountNumber: "1234567890",
incomeAmount: undefined,
address: undefined,
});
if (!extracted.success) {
return {
extracted: null,
validationErrors: ["Schema validation failed"],
needsHumanReview: true,
};
}
return {
extracted: extracted.data,
validationErrors: [],
needsHumanReview: false,
rawText: text,
};
};
const validateNode = async (state: typeof GraphState.State) => {
const errors: string[] = [];
const doc = state.extracted;
if (!doc) {
errors.push("No structured output returned");
return { validationErrors: errors, needsHumanReview: true };
}
if (doc.documentType === "bank_statement" && !doc.accountNumber) {
errors.push("Bank statement missing account number");
}
if (doc.documentType === "payslip" && typeof doc.incomeAmount !== "number") {
errors.push("Payslip missing income amount");
}
return {
validationErrors: errors,
needsHumanReview: errors.length > zero ? true : state.needsHumanReview,
};
};
3. Route low-confidence cases to human review
LangGraph’s conditional edges are the right fit here. If the document fails validation or confidence is too low, stop automation and hand off to an analyst queue.
const humanReviewNode = async (state: typeof GraphState.State) => {
const doc = state.extracted;
return {
validationErrors: [...state.validationErrors, "Sent to human review"],
needsHumanReview: true,
extracted: doc,
};
};
const workflow = new StateGraph(GraphState)
.addNode("extract", extractNode)
.addNode("validate", validateNode)
.addNode("human_review", humanReviewNode)
.addEdge("__start__", "extract")
.addEdge("extract", "validate")
.addConditionalEdges("validate", (state) => {
return state.needsHumanReview ? "human_review" : "__end__";
}, {
human_review: "human_review",
__end__: "__end__",
});
4. Compile and run the graph
Compile once at startup and reuse it across requests. In production banking systems, you want deterministic execution paths and stable model/version pinning.
const app = workflow.compile();
const result = await app.invoke({
rawText:
"Jane Doe\nAccount Number: 1234567890\nStatement Date: Jan 2026\nBalance...",
extracted: null,
validationErrors: [],
needsHumanReview: false,
});
console.log(result);
Production Considerations
- •
Data residency
- •Keep OCR text and extracted fields in-region if your bank operates under local residency requirements.
- •Pin storage buckets, queues, and model endpoints to approved jurisdictions.
- •
Auditability
- •Persist the full graph trace per document:
- •input hash
- •model name/version
- •extracted JSON
- •validation failures
- •reviewer decision
- •This is what internal audit will ask for when a loan or onboarding decision is challenged.
- •Persist the full graph trace per document:
- •
Guardrails
- •Reject free-form outputs. Only accept schema-valid JSON.
- •Add deterministic checks for known banking rules before any downstream write.
- •Block sensitive documents from being processed by non-approved models.
- •
Monitoring
| Metric | Why it matters | Action threshold |
|---|---|---|
| Extraction success rate | Measures how often documents become usable data | Drop below baseline triggers investigation |
| Human review rate | Shows model drift or bad scan quality | Spike indicates prompt/model regression |
| Field-level error rate | Catches bad account numbers or names | Any rise on regulated fields needs rollback |
| Latency per document | Impacts onboarding SLA | P95 above target requires scaling |
Common Pitfalls
- •Using one generic prompt for every document type
Retail banking documents vary a lot. A payslip has different fields than a bank statement or utility bill. Split by document type first, then apply targeted extraction logic.
- •Skipping strict validation
If you let the model decide what “looks right,” you will ship bad data into KYC or credit systems. Validate field formats, required values, and cross-field consistency before acceptance.
- •Not preserving the raw source
You need the original OCR text and file hash for disputes and compliance reviews. Store both alongside the extracted payload so auditors can reproduce what happened.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit