How to Build a document extraction Agent Using LlamaIndex in TypeScript for retail banking

By Cyprian AaronsUpdated 2026-04-21

document-extractionllamaindextypescriptretail-banking

A document extraction agent for retail banking takes unstructured files like bank statements, payslips, utility bills, and KYC forms, then turns them into structured fields your systems can validate and store. It matters because onboarding, loan origination, and fraud checks all depend on getting those fields right, with an audit trail that can survive compliance review.

Architecture

•
Document ingestion layer
- •Accept PDFs, images, and office docs from channels like branch upload, mobile app, or back-office queues.
- •Normalize files before extraction so the model sees consistent input.
•
OCR and text parsing layer
- •Use OCR for scanned documents and a parser for digital PDFs.
- •Keep page-level metadata so you can trace every extracted field back to source.
•
LlamaIndex extraction layer
- •Use Document objects plus an LLM-backed extraction pipeline.
- •Structure outputs with schema-first parsing so downstream systems get typed JSON, not free text.
•
Validation and policy layer
- •Enforce retail banking rules like name matching, address format checks, income thresholds, and document freshness.
- •Reject or flag low-confidence extractions for manual review.
•
Audit and storage layer
- •Persist raw input hash, extracted payload, model version, prompt version, and timestamps.
- •Store results in a region that matches your data residency requirement.

Implementation

•Install the core dependencies

Use LlamaIndex’s TypeScript packages with an OpenAI-compatible model. In production banking systems, I prefer keeping the model behind a controlled provider wrapper so you can swap vendors without rewriting the extraction flow.

npm install llamaindex zod

•Define the extraction schema

Banking extraction works best when you force a strict shape up front. Here we define a schema for a common retail banking use case: proof of income or identity verification.

import { z } from "zod";

export const BankingDocumentSchema = z.object({
  documentType: z.enum(["bank_statement", "payslip", "utility_bill", "id_document"]),
  fullName: z.string(),
  documentDate: z.string(),
  accountNumber: z.string().optional(),
  address: z.string().optional(),
  employerName: z.string().optional(),
  grossIncome: z.number().optional(),
  currency: z.string().optional(),
});

•Load the document and extract structured data with LlamaIndex

The key pattern is: load the file into a Document, create an index over it when needed, then ask the query engine for structured output using a schema. For single-document extraction jobs, this is usually enough; you do not need a complex RAG setup.

import fs from "node:fs";
import path from "node:path";
import { Document } from "llamaindex";
import { OpenAI } from "@llamaindex/openai";
import { extract } from "llamaindex";
import { BankingDocumentSchema } from "./schema";

async function run() {
  const filePath = path.resolve("./samples/bank-statement.txt");
  const rawText = fs.readFileSync(filePath, "utf8");

  const doc = new Document({
    text: rawText,
    metadata: {
      sourceFile: path.basename(filePath),
      region: "eu-west-1",
      retentionClass: "kyc",
    },
  });

  const llm = new OpenAI({
    model: "gpt-4o-mini",
    apiKey: process.env.OPENAI_API_KEY,
  });

  const result = await extract({
    documents: [doc],
    llm,
    schema: BankingDocumentSchema,
    prompt:
      "Extract the requested banking fields exactly as they appear in the document. If a field is missing, return null or omit optional fields.",
  });

  console.log(JSON.stringify(result[0], null, 2));
}

run().catch(console.error);

That extract() pattern gives you typed output instead of hand-parsing model text. In retail banking, that matters because every downstream rule engine expects stable keys like fullName, documentDate, and accountNumber.

•Add validation and routing for manual review

Extraction alone is not enough. You need deterministic checks before anything reaches onboarding or credit decisioning.

type ExtractionResult = {
  fullName?: string;
  documentDate?: string;
  accountNumber?: string;
};

function needsReview(result: ExtractionResult): boolean {
  if (!result.fullName || !result.documentDate) return true;

  // Example policy checks
  if (result.accountNumber && result.accountNumber.length < 8) return true;

  return false;
}

Use this gate to route suspicious cases to ops staff. For example:

•Missing name on an ID document
•Statement date older than your allowed KYC window
•Account number format mismatch
•OCR confidence below threshold

Production Considerations

•
Data residency
- •Keep source documents and extracted payloads in-region.
- •If you process EU retail customers, make sure both storage and model traffic follow your residency policy.
•
Auditability
- •Log documentId, hash of original file, prompt version, model name, schema version, and final output.
- •Store enough context to reconstruct why a field was accepted or rejected during compliance review.
•
Guardrails
- •Do not let the model invent missing values.
- •Treat optional fields as optional; force nulls or omissions rather than guessed data.
- •Add deterministic validators for dates, currency codes, IBAN/account formats, and address structure.
•
Monitoring
- •Track extraction accuracy by document type.
- •Watch for drift when statement layouts change or OCR quality drops.
- •Alert on spikes in manual review rates; that usually means template drift or upstream scan issues.

Common Pitfalls

•
Using free-form prompts without a schema

This produces inconsistent JSON-like output that breaks downstream systems. Always bind extraction to a strict Zod schema or equivalent typed contract.
•
Skipping OCR normalization

Scanned bank statements often contain broken line order and merged columns. If you feed raw OCR noise directly into the extractor, field accuracy drops fast.
•
Ignoring compliance metadata

Teams often store only the extracted fields and lose source traceability. Keep document hashes, timestamps, region tags, and model versions so audit teams can reconstruct the decision path later.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit