How to Build a document extraction Agent Using LlamaIndex in TypeScript for lending

By Cyprian AaronsUpdated 2026-04-21

document-extractionllamaindextypescriptlending

A document extraction agent for lending reads borrower docs like payslips, bank statements, tax returns, and IDs, then turns them into structured fields your underwriting workflow can trust. It matters because lending decisions depend on speed, accuracy, and traceability; if you extract income, liabilities, and identity data incorrectly, you create credit risk and compliance risk at the same time.

Architecture

•
Document intake layer
- •Accept PDFs, scanned images, and office docs from a secure upload service or object store.
- •Normalize filenames, enforce size limits, and attach request metadata like applicationId and jurisdiction.
•
OCR / parsing layer
- •Convert scans into text before extraction.
- •For digital PDFs, extract text directly where possible to preserve layout fidelity.
•
LlamaIndex ingestion pipeline
- •Use Document, SentenceSplitter, and an index/query layer to make the content searchable.
- •Keep chunking deterministic so field extraction is repeatable for audit.
•
Extraction agent
- •Use LlamaIndex’s TypeScript QueryEngine or ChatEngine over the indexed document set.
- •Return structured JSON for fields such as employer name, gross income, net income, account balances, and document dates.
•
Validation and policy layer
- •Validate outputs against schemas and lending rules.
- •Flag missing evidence, conflicting values across documents, or stale documents older than policy thresholds.
•
Audit and storage layer
- •Store source document references, extracted values, confidence notes, prompt/version metadata, and reviewer overrides.
- •This is mandatory if you need explainability during model risk review or regulator audits.

Implementation

1) Install the TypeScript packages

Use the TypeScript SDK plus an LLM provider. The exact package names can vary by provider setup, but the LlamaIndex side stays consistent.

npm install llamaindex zod dotenv

Set your environment variables:

export OPENAI_API_KEY="your-key"

2) Load lending documents as LlamaIndex `Document` objects

This pattern keeps each source file traceable. In lending, you want to preserve document boundaries so you can say exactly which page or file produced a field.

import "dotenv/config";
import { Document } from "llamaindex";

type UploadedFile = {
  id: string;
  filename: string;
  text: string;
};

const files: UploadedFile[] = [
  {
    id: "paystub-001",
    filename: "paystub_march.pdf",
    text: `
      Acme Corp
      Employee: Jane Doe
      Gross Pay: $8,500.00
      Net Pay: $6,120.00
      Pay Date: 2026-03-31
    `,
  },
  {
    id: "bankstmt-001",
    filename: "bank_statement.pdf",
    text: `
      Account Holder: Jane Doe
      Ending Balance: $14,220.55
      Statement Date: 2026-03-31
    `,
  },
];

const documents = files.map(
  (file) =>
    new Document({
      id_: file.id,
      text: file.text,
      metadata: {
        filename: file.filename,
        sourceSystem: "borrower-upload",
        docType: file.filename.includes("paystub") ? "paystub" : "bank_statement",
      },
    }),
);

3) Build an index and query it with a strict extraction prompt

For extraction tasks in lending, keep the prompt narrow. Ask for only the fields you need and force JSON output so downstream validation is deterministic.

import {
  VectorStoreIndex,
  Settings,
} from "llamaindex";
import { OpenAI } from "@llamaindex/openai";

Settings.llm = new OpenAI({
  model: "gpt-4o-mini",
});

const index = await VectorStoreIndex.fromDocuments(documents);

const queryEngine = index.asQueryEngine({
  similarityTopK: 4,
});

const response = await queryEngine.query({
  query:
    `Extract borrower financial facts for a lending application.
Return valid JSON with these keys:
borrower_name, employer_name, gross_pay_monthly_usd, net_pay_monthly_usd,
ending_bank_balance_usd, pay_date, statement_date.
If a field is missing or unclear, use null.`,
});

console.log(response.toString());

4) Validate the result before it reaches underwriting

Do not pass raw model output straight into decisioning. Parse it with a schema and reject anything that does not conform.

import { z } from "zod";

const ExtractionSchema = z.object({
  borrower_name: z.string().nullable(),
  employer_name: z.string().nullable(),
  gross_pay_monthly_usd: z.number().nullable(),
  net_pay_monthly_usd: z.number().nullable(),
  ending_bank_balance_usd: z.number().nullable(),
  pay_date: z.string().nullable(),
  statement_date: z.string().nullable(),
});

function safeParseExtraction(rawText: string) {
  const parsed = JSON.parse(rawText);
  return ExtractionSchema.parse(parsed);
}

const extracted = safeParseExtraction(response.toString());
console.log(extracted);

Production Considerations

•
Auditability
- •Persist the original document hash, extracted JSON, model version, prompt version, and retrieval parameters.
- •For lending ops and regulators, you need to reconstruct why a field was accepted or rejected.
•
Data residency
- •Keep borrower PII in-region if your lending book requires it.
- •If you use external LLM endpoints, verify where data is processed and whether retention is disabled.
•
Guardrails
- •Reject outputs that fail schema validation or violate business rules like negative income or future-dated statements.
- •Add human review for low-confidence cases or when multiple documents disagree on identity or income.
•
Monitoring
- •
  Track extraction accuracy by doc type:
  - •payslips
  - •bank statements
  - •tax forms
  - •IDs
- •Monitor null-rate spikes; they usually mean OCR drift, template changes, or prompt regressions.

Common Pitfalls

•
Using one prompt for every document type
- •Payslips and bank statements have different structures.
- •Avoid this by routing documents by type first and using doc-specific extraction prompts.
•
Skipping schema validation
- •LLMs will happily return malformed JSON or hallucinated values.
- •Always validate with something like zod before the output touches underwriting logic.
•
Losing source traceability
- •If you merge all docs into one blob without metadata, audit becomes painful.
- •Keep per-document id_, filename metadata, and store which source produced each extracted field.
•
Ignoring stale or conflicting evidence
- •A three-month-old bank statement may be invalid for policy even if it parses cleanly.
- •Enforce recency checks and compare cross-document values before you score the application.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit