How to Build a document extraction Agent Using LangChain in TypeScript for banking

By Cyprian AaronsUpdated 2026-04-21

document-extractionlangchaintypescriptbanking

A document extraction agent turns unstructured banking documents into structured data your downstream systems can use. In practice, it reads things like KYC forms, bank statements, loan applications, proof-of-income letters, and trade finance docs, then extracts fields with traceability so compliance teams can review what was pulled and why it matters.

For banking, this is not just a convenience layer. It reduces manual ops work, shortens onboarding time, and gives you an auditable pipeline for regulated document handling.

Architecture

•
Document ingestion layer
- •Accepts PDFs, scans, or text files from secure storage or an internal upload service.
- •Normalizes file metadata like customer ID, case ID, jurisdiction, and retention policy.
•
Text extraction layer
- •Uses OCR or PDF text parsing before the LLM sees anything.
- •Keeps page-level provenance so every extracted field can be traced back to source content.
•
LangChain extraction chain
- •Uses ChatOpenAI plus structured output via withStructuredOutput() or tool-style extraction.
- •Produces typed JSON instead of free-form text.
•
Validation layer
- •Checks extracted values against schema rules and banking constraints.
- •Rejects malformed dates, invalid account formats, or missing mandatory KYC fields.
•
Audit and logging layer
- •Stores prompt version, model version, document hash, extracted JSON, and confidence signals.
- •Supports regulator review and internal model governance.
•
Secure persistence layer
- •Writes only approved outputs to a database or case management system.
- •Enforces data residency and encryption requirements.

Implementation

1) Define the extraction schema

Start with a strict schema. In banking, loose JSON is a liability because downstream teams need deterministic fields for onboarding, AML checks, and underwriting.

import { z } from "zod";

export const BankDocumentSchema = z.object({
  documentType: z.enum(["bank_statement", "kyc_form", "loan_application", "proof_of_income"]),
  customerName: z.string().min(1),
  accountNumber: z.string().optional(),
  iban: z.string().optional(),
  currency: z.string().length(3).optional(),
  statementPeriodStart: z.string().optional(),
  statementPeriodEnd: z.string().optional(),
  totalIncome: z.number().optional(),
  totalDebits: z.number().optional(),
  totalCredits: z.number().optional(),
});

export type BankDocumentExtraction = z.infer<typeof BankDocumentSchema>;

2) Build the LangChain extraction chain

This pattern uses ChatOpenAI and withStructuredOutput() so the model returns schema-valid output. That gives you much better control than parsing raw markdown.

import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";
import { BankDocumentSchema } from "./schema";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

const extractor = llm.withStructuredOutput(BankDocumentSchema);

export async function extractBankDocument(text: string) {
  const prompt = `
You are extracting data from a banking document.
Return only fields supported by the schema.
If a field is not present in the document, omit it.
Do not guess account numbers or monetary values.
`;

  const result = await extractor.invoke([
    { role: "system", content: prompt },
    new HumanMessage(text),
  ]);

  return result;
}

This is the core pattern. The important part is that the schema is enforced at the model boundary, not after you already stored bad data.

3) Add page-aware preprocessing and provenance

In banking workflows, you need to know where each field came from. If your source is a PDF, split it into pages first and keep page references alongside the extracted payload.

import fs from "node:fs/promises";
import pdfParse from "pdf-parse";
import { extractBankDocument } from "./extractor";

export async function extractFromPdf(filePath: string) {
  const buffer = await fs.readFile(filePath);
  const parsed = await pdfParse(buffer);

  const pagesText = parsed.text
    .split(/\f/g)
    .map((pageText, index) => ({
      pageNumber: index + 1,
      text: pageText.trim(),
    }))
    .filter((p) => p.text.length > 0);

  const combinedText = pagesText
    .map((p) => `PAGE ${p.pageNumber}\n${p.text}`)
    .join("\n\n");

  const extracted = await extractBankDocument(combinedText);

  return {
    sourceFile: filePath,
    pageCount: pagesText.length,
    extracted,
    documentHashHint: buffer.length,
  };
}

If you need stronger provenance, store per-page chunks and run extraction per page or per section. That makes audit review easier when compliance asks where a specific field came from.

4) Validate before persistence

Do not write model output directly into your case system. Validate it again in application code and reject anything that violates your bank’s policy rules.

import { BankDocumentSchema } from "./schema";
import { extractFromPdf } from "./pipeline";

async function main() {
  const result = await extractFromPdf("./documents/customer-statement.pdf");
  
  const validated = BankDocumentSchema.safeParse(result.extracted);
  
  if (!validated.success) {
    console.error("Extraction failed validation:", validated.error.flatten());
    process.exit(1);
  }

  console.log("Validated extraction:", validated.data);
}

main();

That second validation pass matters because it gives you a hard stop before bad data reaches onboarding or AML workflows.

Production Considerations

•
Data residency
- •Keep document processing in-region if your banking policy requires it.
- •Choose model endpoints and storage buckets that match jurisdictional constraints.
•
Auditability
- •Log prompt template version, model name, input hash, output hash, and validation result.
- •Store enough context for internal audit without dumping raw sensitive documents into general logs.
•
Guardrails
- •Block unsupported fields like SSNs or card PANs unless the use case explicitly permits them. )
•
Monitoring Monitor extraction accuracy by document type, not just global success rate. Track rejection rates, schema failures**,and manual review volume so you can spot drift early.

Common Pitfalls

•
Using free-form text output
- •Problem: parsing natural language responses breaks as soon as formatting changes.
- •Fix: use withStructuredOutput() with a Zod schema and reject non-conforming output.
•
Skipping provenance
- •Problem: compliance cannot verify where a field came from during an audit.
- •Fix: keep source file IDs, page numbers, hashes, and prompt versions with every extraction record.
•
Letting the model infer missing values
- •Problem: hallucinated account numbers or dates create real operational risk.
- •Fix: instruct the model not to guess, validate required fields strictly, and route incomplete cases to human review.

A banking-grade document extraction agent is mostly about control. LangChain gives you the orchestration layer; your job is to make sure every output is typed, traceable, region-compliant, and safe enough to feed into regulated workflows.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit