How to Build a document extraction Agent Using LangChain in TypeScript for fintech

By Cyprian AaronsUpdated 2026-04-21

document-extractionlangchaintypescriptfintech

A document extraction agent takes unstructured fintech documents like bank statements, invoices, loan applications, KYC forms, and trade confirmations, then turns them into validated structured data. That matters because the real cost in fintech is not just reading documents — it’s reducing manual ops, preventing bad data from entering downstream systems, and keeping every extracted field auditable for compliance.

Architecture

•
Document ingestion layer
- •Accepts PDFs, images, or text exports from S3, blob storage, email attachments, or internal upload APIs.
- •Normalizes the input into text before extraction.
•
OCR / text extraction layer
- •Uses a deterministic OCR service when documents are scanned or image-based.
- •Keeps the extraction agent focused on structuring data, not guessing from pixels.
•
LangChain extraction chain
- •Uses ChatOpenAI with StructuredOutputParser or schema-driven output.
- •Extracts fields into a typed object instead of free-form text.
•
Validation and policy layer
- •Checks required fields, formats, ranges, and cross-field consistency.
- •Applies fintech rules like currency codes, date validity, account number length, and jurisdiction constraints.
•
Audit and persistence layer
- •Stores raw input hash, model output, validation results, and prompt version.
- •Supports traceability for disputes, internal audit, and regulatory reviews.

Implementation

1) Define the schema you want to extract

For fintech use cases, do not start with “extract everything.” Start with a strict schema that matches your downstream workflow. Here’s an example for invoice extraction.

import { z } from "zod";

export const InvoiceSchema = z.object({
  vendorName: z.string().min(1),
  invoiceNumber: z.string().min(1),
  invoiceDate: z.string().min(1), // validate ISO format downstream
  currency: z.string().length(3),
  totalAmount: z.number(),
  dueDate: z.string().optional(),
  vatNumber: z.string().optional(),
});

export type Invoice = z.infer<typeof InvoiceSchema>;

This matters because fintech teams need predictable outputs. If you let the model return arbitrary JSON keys, your reconciliation pipeline will break the first time the model decides to rename totalAmount to amount_due.

2) Build a LangChain extractor with structured output

Use ChatOpenAI plus .withStructuredOutput() so the model returns typed data that matches your Zod schema. This is cleaner than hand-parsing raw JSON strings.

import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";
import { InvoiceSchema } from "./schema";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

const extractor = llm.withStructuredOutput(InvoiceSchema);

export async function extractInvoice(text: string) {
  const prompt = `
You are extracting invoice data for a fintech AP workflow.
Return only fields present in the schema.
If a field is missing in the document, omit it or leave it null where appropriate.
Document:
${text}
`;

  const result = await extractor.invoke([new HumanMessage(prompt)]);
  return result;
}

That pattern uses actual LangChain APIs:

•ChatOpenAI
•withStructuredOutput()
•invoke()
•HumanMessage

If you need stronger control over field descriptions and format hints, add descriptions to the Zod schema or move to a parser-driven prompt. For most extraction agents in production, structured output is enough if your schema is tight.

3) Add validation and normalization before writing to your system of record

The LLM should not be your source of truth. Treat its output as candidate data that must pass deterministic checks before persistence.

import { InvoiceSchema } from "./schema";

function normalizeCurrency(currency: string) {
  return currency.toUpperCase();
}

export async function processInvoice(text: string) {
  const extracted = await extractInvoice(text);
  const parsed = InvoiceSchema.parse({
    ...extracted,
    currency: normalizeCurrency(extracted.currency),
    totalAmount: Number(extracted.totalAmount),
  });

  if (!/^[A-Z]{3}$/.test(parsed.currency)) {
    throw new Error(`Invalid currency code: ${parsed.currency}`);
  }

  return parsed;
}

For fintech workflows:

•Validate dates against business rules
•Check totals against line-item sums if available
•Reject unsupported currencies
•Flag missing tax IDs or registration numbers for manual review

4) Store audit metadata alongside extracted fields

You need an evidence trail. In regulated environments, being able to show what was extracted, when it was extracted, and which model/prompt version did it is non-negotiable.

type AuditRecord = {
  documentHash: string;
  promptVersion: string;
  modelName: string;
  extractedAt: string;
};

export async function persistExtraction(
  invoice: Invoice,
  audit: AuditRecord
) {
  // write both invoice + audit metadata to your database
  console.log({ invoice, audit });
}

A practical production record should include:

•Raw document hash
•OCR text hash
•Model name and version
•Prompt version
•Validation errors
•Human override status

Production Considerations

•
Data residency
- •Route documents through region-specific infrastructure if you handle EU banking or local regulatory workloads.
- •Do not send sensitive documents to models hosted outside approved jurisdictions without legal approval.
•
Monitoring
- •Track extraction accuracy by field, not just overall success rate.
- •Measure null rates for critical fields like invoice number, amount, IBAN, tax ID, and date.
•
Guardrails
- •Block outputs that fail schema validation before they hit downstream finance systems.
- •Add deterministic checks for country-specific identifiers and known business rules.
•
Auditability
- •Log prompt versions and model versions with every extraction.
- •Keep immutable records for disputes and compliance reviews.

Common Pitfalls

•
Using free-form text output instead of structured output
- •This creates brittle parsing logic and silent failures.
- •Fix it by using withStructuredOutput() plus Zod validation.
•
Trusting OCR text without validation
- •OCR errors on account numbers, invoice totals, and dates will poison your pipeline.
- •Fix it with checksum-style checks where possible and human review for low-confidence cases.
•
Skipping compliance metadata
- •If you cannot prove what was extracted and how it was processed, audit becomes expensive fast.
- •Fix it by persisting hashes, prompt versions, model IDs, timestamps, and reviewer actions.

A document extraction agent in fintech is not just an LLM wrapper. It is a controlled data pipeline with strict schemas, deterministic validation, traceable outputs, and regional compliance boundaries. Build it that way from day one if you want something operations can trust.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit