How to Build a document extraction Agent Using LlamaIndex in TypeScript for banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionllamaindextypescriptbanking

A document extraction agent in banking reads unstructured files like loan applications, KYC packets, bank statements, and trade finance forms, then turns them into structured fields your systems can validate and route. It matters because most operational delay in banking is not model quality — it is manual document handling, inconsistent extraction, and weak auditability.

Architecture

  • Document ingestion layer

    • Pulls PDFs, DOCX, scans, or email attachments from approved storage.
    • Normalizes files before extraction so the agent sees consistent input.
  • Text extraction and parsing layer

    • Uses LlamaIndex readers to load documents into Document objects.
    • Handles OCR upstream if the source is scanned images.
  • Extraction schema

    • Defines the fields you need: customer name, account number, loan amount, effective date, tax ID, etc.
    • Keeps output deterministic and easy to validate against banking rules.
  • LLM-powered extractor

    • Uses LlamaIndex’s structured prediction flow to map text into a typed object.
    • Produces JSON that downstream systems can validate and store.
  • Validation and policy layer

    • Checks extracted values against business rules.
    • Flags missing fields, mismatched totals, expired IDs, or unsupported jurisdictions.
  • Audit and persistence layer

    • Stores raw input references, extracted output, confidence signals, and model version.
    • Required for compliance review, dispute handling, and traceability.

Implementation

1) Install dependencies and set environment variables

Use the TypeScript package for LlamaIndex and a provider-compatible LLM. For this example, I’m using OpenAI through LlamaIndex’s OpenAI class.

npm install llamaindex zod dotenv

Set your key:

export OPENAI_API_KEY="your-key"

2) Define the extraction schema

Keep the schema tight. Banking workflows need predictable output, not free-form summaries.

import "dotenv/config";
import { z } from "zod";
import { Document } from "llamaindex";
import { OpenAI } from "@llamaindex/openai";

const BankingDocSchema = z.object({
  customerName: z.string(),
  documentType: z.enum(["kyc", "loan_application", "bank_statement", "invoice", "other"]),
  accountNumber: z.string().optional(),
  taxId: z.string().optional(),
  amount: z.number().optional(),
  currency: z.string().optional(),
  effectiveDate: z.string().optional(),
  expiryDate: z.string().optional(),
  riskFlags: z.array(z.string()).default([]),
});

type BankingDoc = z.infer<typeof BankingDocSchema>;

3) Load documents and run structured extraction

This pattern uses Document, OpenAI, and structuredPredict to turn raw text into typed JSON. In production you would load from approved storage after OCR/PDF parsing; here I’m using inline text so the pattern is clear.

import { Settings } from "llamaindex";

async function extractBankingFields(): Promise<BankingDoc> {
  const llm = new OpenAI({
    model: "gpt-4o-mini",
    temperature: 0,
  });

  Settings.llm = llm;

  const doc = new Document({
    text: `
      Customer Name: Amina Patel
      Document Type: KYC
      Account Number: 0091842231
      Tax ID: TIN-88421
      Effective Date: 2025-01-12
      Expiry Date: 2027-01-12
      Notes: Address verification completed. No sanctions hit found.
    `,
    metadata: {
      sourceSystem: "onboarding_portal",
      region: "eu-west-1",
      retentionClass: "regulated",
    },
  });

  const result = await llm.structuredPredict(BankingDocSchema, {
    prompt:
      "Extract banking document fields exactly as structured data. If a field is missing, omit it unless required by schema.",
    context:
      doc.text,
  });

  return result;
}

extractBankingFields().then((data) => {
  console.log(JSON.stringify(data, null, 2));
});

4) Add validation before downstream write

Extraction is not the finish line. Banking systems should reject incomplete or suspicious records before they hit core workflows.

function validateExtraction(data: BankingDoc): string[] {
  const issues: string[] = [];

  if (!data.customerName?.trim()) issues.push("missing_customer_name");
  if (data.documentType === "kyc" && !data.taxId) issues.push("missing_tax_id");
  
  if (data.amount !== undefined && data.amount <= 0) {
    issues.push("invalid_amount");
  }

  
 
 if (data.expiryDate && data.effectiveDate && data.expiryDate < data.effectiveDate) {
    issues.push("expiry_before_effective");
 }
  
 return issues;
}

A practical flow is:

  1. Load document from approved storage.
  2. Extract structured fields with structuredPredict.
  3. Validate against business rules.
  4. Write both raw output and validation results to an audit store.
  5. Route exceptions to human review.

Production Considerations

  • Data residency

    • Keep ingestion and inference in-region when handling regulated customer data.
    • If your bank requires EU-only processing or local sovereign cloud boundaries, enforce that at the deployment layer before any document leaves the boundary.
  • Auditability

    • Persist the original document reference, extracted JSON, prompt version, model name, timestamp, and reviewer overrides.
    • Auditors will ask how a field was produced; you need a reproducible chain of custody.
  • Guardrails

    • Add schema validation with zod before writing to downstream systems.
    • Block high-risk actions like account creation or credit decisions until human review clears low-confidence or missing-field cases.
    • Redact PII in logs; never dump raw documents into application logs.
  • Monitoring

    • Track extraction failure rate by document type and source system.
    • Measure field-level accuracy on sampled reviews; overall success rate hides bad extractions on critical fields like account number or tax ID.
    • Alert on spikes in empty outputs or repeated validation failures after model upgrades.

Common Pitfalls

  1. Using free-form prompts instead of a strict schema

    • This creates inconsistent output and brittle downstream parsing.
    • Avoid it by using zod schemas with structuredPredict so every response matches a known contract.
  2. Skipping OCR quality checks

    • Scanned banking documents often fail because the text layer is garbage before the model even sees it.
    • Fix this by running OCR upstream and rejecting low-confidence scans before extraction starts.
  3. Treating extracted data as final truth

    • LLMs can miss digits, confuse dates, or infer values that are not explicitly present.
    • Always validate against business rules and send edge cases to operations for review.

A banking document extraction agent should be boring in production. It should produce consistent JSON, leave an audit trail, respect residency constraints, and fail closed when the input is weak or ambiguous.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides