How to Build a document extraction Agent Using LlamaIndex in TypeScript for wealth management

By Cyprian AaronsUpdated 2026-04-21
document-extractionllamaindextypescriptwealth-management

A document extraction agent for wealth management reads client statements, account opening packets, tax forms, IPS documents, and advisor notes, then turns them into structured data you can validate and route into downstream systems. It matters because most operational risk in wealth workflows comes from manual rekeying, missed fields, and inconsistent interpretation of unstructured documents.

Architecture

  • Document ingestion layer

    • Pull PDFs, scans, and text files from secure storage or an internal upload service.
    • Keep source metadata: client ID, document type, timestamp, region, and retention policy.
  • Text extraction and chunking

    • Use LlamaIndex loaders to read the document content.
    • Split long documents into chunks so the extractor can handle statements and disclosures reliably.
  • Extraction schema

    • Define the fields you need: account number, beneficiary names, taxable events, contribution amounts, advisor comments, and risk profile changes.
    • Keep the schema explicit so compliance can review it.
  • LLM-powered parser

    • Use a LlamaIndex OpenAI LLM to extract structured output from chunks.
    • Constrain output to JSON so downstream systems do not depend on free-form text.
  • Validation and policy layer

    • Verify extracted values against business rules.
    • Flag missing signatures, mismatched names, suspicious transfers, or jurisdiction-specific issues.
  • Audit trail and persistence

    • Store raw text references, model version, prompt version, extracted JSON, and validation results.
    • Wealth management teams need traceability for reviews and regulatory audits.

Implementation

1) Install dependencies and set up the project

You need LlamaIndex for TypeScript plus a runtime that can load PDFs or text files. For production, keep secrets in environment variables and never hardcode API keys.

npm install llamaindex zod
npm install -D typescript tsx @types/node

Set your OpenAI key:

export OPENAI_API_KEY="your-key"

2) Load the document with LlamaIndex

Use SimpleDirectoryReader for files on disk. In a real wealth management system this is often a staging folder behind an upload service or an S3 sync job.

import { SimpleDirectoryReader } from "llamaindex";

async function loadDocuments() {
  const reader = new SimpleDirectoryReader();
  const docs = await reader.loadData({
    directoryPath: "./client-documents",
  });

  return docs;
}

If your input is a mix of PDFs and OCR text exports, normalize them before extraction. The agent should operate on clean text whenever possible because statement layouts vary a lot across custodians.

3) Build a structured extraction chain

For wealth workflows you want deterministic output. Use OpenAI, PromptTemplate, StructuredOutputParser, and a schema validator like zod so the result is machine-safe.

import { OpenAI } from "llamaindex";
import { z } from "zod";

const ExtractionSchema = z.object({
  clientName: z.string(),
  accountNumber: z.string().optional(),
  documentType: z.enum(["statement", "kye", "tax_form", "ips", "other"]),
  advisorName: z.string().optional(),
  effectiveDate: z.string().optional(),
  keyFields: z.array(
    z.object({
      field: z.string(),
      value: z.string(),
      confidence: z.number().min(0).max(1),
    })
  ),
});

type ExtractionResult = z.infer<typeof ExtractionSchema>;

const llm = new OpenAI({
  model: "gpt-4o-mini",
});

async function extractFromText(text: string): Promise<ExtractionResult> {
  const prompt = `
You are extracting structured data from a wealth management document.
Return only valid JSON matching this shape:
{
  "clientName": string,
  "accountNumber"?: string,
  "documentType": "statement" | "kye" | "tax_form" | "ips" | "other",
  "advisorName"?: string,
  "effectiveDate"?: string,
  "keyFields": [
    { "field": string, "value": string, "confidence": number }
  ]
}

Document:
${text}
`;

  const response = await llm.complete({ prompt });
  const parsed = JSON.parse(response.text);
  return ExtractionSchema.parse(parsed);
}

This pattern is simple but production-friendly. The model returns JSON only, then zod enforces the contract before anything hits your CRM or portfolio system.

4) Run extraction over each document and persist results

Treat each file as an auditable unit of work. Save both the raw source reference and the extracted payload so compliance can reconstruct how a field was produced.

import fs from "node:fs/promises";
import path from "node:path";
import { Document } from "llamaindex";

async function processDocuments() {
  const docs = await loadDocuments();

  const results = [];
  
   for (const doc of docs as Document[]) {
    const text = typeof doc.text === "string" ? doc.text : String(doc.text ?? "");
    const extracted = await extractFromText(text);

    const record = {
      sourceId: doc.id_,
      metadata: doc.metadata ?? {},
      extracted,
      model: "gpt-4o-mini",
      extractedAt: new Date().toISOString(),
    };

    results.push(record);
    await fs.writeFile(
      path.join("./out", `${doc.id_}.json`),
      JSON.stringify(record, null, 2)
    );
   }

   return results;
}

processDocuments().then(console.log);

In practice you would add retries around transient model failures and queue jobs instead of processing synchronously. For regulated workflows that also means storing region-specific data in region-specific infrastructure.

Production Considerations

  • Data residency

    • Keep EU client data in EU-hosted infrastructure if your policy requires it.
    • Do not send documents across regions just to simplify processing.
  • Auditability

    • Log document hash, prompt version, model name, extraction timestamp, and validation outcome.
    • Store enough provenance to explain why an extracted value was accepted or rejected.
  • Guardrails

    • Block extraction results that fail schema validation or violate business rules.
    • Add human review for high-risk fields like beneficiaries, transfer instructions, trust language, or tax elections.
  • Monitoring

    • Track field-level accuracy by document type.
    • Watch for drift when custodians change statement formats or when OCR quality drops.

Common Pitfalls

  • Using free-form output instead of strict JSON

    This makes downstream automation brittle. Always parse into a schema with zod or equivalent validation before persisting anything.

  • Ignoring document type differences

    A tax form does not behave like an IPS or monthly statement. Route documents by type first so prompts stay narrow and extraction quality stays stable.

  • Skipping audit metadata

    If you cannot show which source produced which field, compliance will push back fast. Store source IDs, hashes, timestamps, model versioning, and validation status every time.

  • Assuming one-pass extraction is enough

    Wealth documents often contain tables, footnotes, and exceptions hidden in dense legal language. Use a second pass for critical fields or route low-confidence outputs to human review.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides