How to Build a document extraction Agent Using LangChain in TypeScript for retail banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionlangchaintypescriptretail-banking

A document extraction agent for retail banking reads incoming PDFs, scans, statements, payslips, IDs, and application forms, then turns them into structured fields your downstream systems can trust. It matters because onboarding, lending, KYC, and dispute handling all depend on fast extraction with low error rates, traceability, and controls around sensitive customer data.

Architecture

  • Document ingestion layer

    • Accepts PDF, image, or text inputs from branch uploads, mobile apps, or back-office queues.
    • Normalizes files before extraction.
  • OCR / text extraction layer

    • Converts scanned documents into text.
    • For production banking flows, this is usually a dedicated OCR service before LangChain sees the content.
  • LangChain extraction chain

    • Uses ChatOpenAI, PromptTemplate, and StructuredOutputParser to map unstructured text into a strict schema.
    • Handles field-level extraction like name, account number, address, income, employer, and document type.
  • Validation and policy layer

    • Checks extracted values against business rules.
    • Rejects malformed account numbers, missing dates, or unsupported document types.
  • Audit and persistence layer

    • Stores raw input references, model output, validation results, prompt version, and trace IDs.
    • Required for compliance review and model governance.
  • Human review queue

    • Routes low-confidence or policy-flagged documents to an operations analyst.
    • Keeps the bank out of silent failure mode.

Implementation

1) Define the extraction schema

For retail banking you want a strict output shape. Do not ask the model for free-form JSON and hope it behaves.

import { z } from "zod";

export const BankDocumentSchema = z.object({
  documentType: z.enum(["bank_statement", "utility_bill", "passport", "payslip", "application_form"]),
  customerName: z.string().optional(),
  accountNumber: z.string().optional(),
  address: z.string().optional(),
  issueDate: z.string().optional(),
  employerName: z.string().optional(),
  monthlyIncome: z.number().optional(),
  confidenceNotes: z.array(z.string()).default([]),
});

export type BankDocument = z.infer<typeof BankDocumentSchema>;

This schema is the contract between your agent and the rest of the bank. If a field is not present in the source document, keep it optional rather than inventing values.

2) Build the LangChain extraction chain

Use ChatOpenAI with StructuredOutputParser so the model returns machine-readable data. This pattern is stable enough for production when paired with validation.

import { ChatOpenAI } from "@langchain/openai";
import { PromptTemplate } from "@langchain/core/prompts";
import { StructuredOutputParser } from "langchain/output_parsers";
import { BankDocumentSchema } from "./schema";

const parser = StructuredOutputParser.fromZodSchema(BankDocumentSchema);

const prompt = PromptTemplate.fromTemplate(`
You are extracting structured fields from retail banking documents.

Return only valid JSON that matches these instructions:
{format_instructions}

Document text:
{documentText}
`);

const model = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

export async function extractBankDocument(documentText: string) {
  const formattedPrompt = await prompt.format({
    documentText,
    format_instructions: parser.getFormatInstructions(),
  });

  const response = await model.invoke(formattedPrompt);
  const parsed = await parser.parse(response.content as string);

  return parsed;
}

This is the core pattern. temperature: 0 reduces variance, and StructuredOutputParser gives you a predictable shape that can be validated before persistence.

3) Add post-processing and banking rules

Extraction is not done when the model returns JSON. Retail banking needs deterministic checks for compliance and operational safety.

import { BankDocumentSchema } from "./schema";

export function validateBankExtraction(raw: unknown) {
  const result = BankDocumentSchema.safeParse(raw);

  if (!result.success) {
    return {
      ok: false,
      reason: "schema_validation_failed",
      issues: result.error.issues,
    };
  }

  const doc = result.data;

  if (doc.accountNumber && !/^\d{8,20}$/.test(doc.accountNumber)) {
    return {
      ok: false,
      reason: "invalid_account_number_format",
    };
  }

  if (doc.monthlyIncome !== undefined && doc.monthlyIncome < 0) {
    return {
      ok: false,
      reason: "invalid_income_value",
    };
    }

  return { ok: true as const, data: doc };
}

This is where you enforce business logic. If your bank only accepts certain document types for KYC or income verification, reject everything else early.

4) Wire it into an API endpoint with audit logging

In production you need traceability for every decision. Log input references, output payloads, validation status, and model version.

import express from "express";
import { extractBankDocument } from "./extract";
import { validateBankExtraction } from "./validate";

const app = express();
app.use(express.json({ limit: "10mb" }));

app.post("/extract", async (req, res) => {
  const { documentText, documentId } = req.body;

   if (!documentText || !documentId) {
    return res.status(400).json({ error: "documentText and documentId are required" });
   }

   const extracted = await extractBankDocument(documentText);
   const validation = validateBankExtraction(extracted);

   // Persist these to your audit store
   console.log(JSON.stringify({
     documentId,
     extracted,
     validation,
     timestamp: new Date().toISOString(),
   }));

   if (!validation.ok) {
     return res.status(422).json(validation);
   }

   return res.json(validation.data);
});

app.listen(3000);

In a real bank this log line becomes an append-only audit event in your SIEM or governed data store. Do not rely on application logs alone for regulatory evidence.

Production Considerations

  • Data residency

    • Keep OCR text and model calls inside approved regions.
    • If your bank operates under country-specific residency rules, pin both inference and storage to the same jurisdiction.
  • Auditability

    • Store prompt version, model name, parser version, validation outcome, and source document reference.
    • Regulators care about how a decision was made after the fact.
  • Guardrails

    • Reject unsupported documents before inference.
    • Route low-confidence outputs to manual review instead of auto-submitting them into onboarding or loan origination systems.
  • Monitoring

    • Track parse failure rate, field-level missingness, hallucinated field rate, and human override rate.
    • A spike in failed extractions often means OCR drift or a prompt regression.

Common Pitfalls

  1. Using free-form generation instead of structured parsing

    • If you ask for “a JSON object” without a parser and schema enforcement, you will get malformed outputs.
    • Use StructuredOutputParser with Zod validation every time.
  2. Skipping OCR quality checks

    • LangChain cannot fix unreadable scans.
    • Run image quality checks upstream for blur, skewed pages, low contrast, and multi-document bundles before sending text to the model.
  3. Treating extracted values as truth

    • Model output is not source-of-record data.
    • For account numbers, names on ID documents, income figures on payslips، and addresses on utility bills; cross-check against core banking systems or route to human review when confidence is low.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides