How to Build a document extraction Agent Using LangChain in TypeScript for wealth management

By Cyprian AaronsUpdated 2026-04-21
document-extractionlangchaintypescriptwealth-management

A document extraction agent for wealth management takes unstructured files like account opening forms, IPS documents, statements, KYC packs, and transfer instructions, then turns them into structured data your downstream systems can trust. That matters because the business is full of high-value workflows where a missed field, wrong beneficiary name, or incomplete compliance check creates operational risk, audit pain, and client friction.

Architecture

A production agent for this use case needs a narrow, auditable pipeline:

  • Document ingestion layer

    • Accept PDFs, scans, and email attachments.
    • Normalize inputs before extraction.
    • Preserve source metadata like filename, upload time, and client ID.
  • Text extraction layer

    • Use OCR for scanned docs.
    • Extract text with page boundaries intact.
    • Keep raw text alongside normalized text for audit.
  • LLM extraction chain

    • Use LangChain to map document text into a strict schema.
    • Enforce structured output for fields like account number, tax residency, beneficial owner, and advisor notes.
  • Validation and policy layer

    • Validate required fields.
    • Flag missing compliance items.
    • Reject or route ambiguous outputs to human review.
  • Audit and storage layer

    • Store extracted JSON plus source references.
    • Persist model version, prompt version, and timestamps.
    • Keep an immutable trail for compliance review.
  • Human-in-the-loop review queue

    • Send low-confidence or policy-sensitive cases to operations staff.
    • Allow corrections before downstream booking or CRM updates.

Implementation

1. Install the right packages

Use LangChain’s TypeScript packages plus a PDF loader. For wealth management you want deterministic extraction, so avoid free-form chat responses.

npm install langchain @langchain/openai @langchain/core pdf-parse zod

Set your environment variables:

export OPENAI_API_KEY="your-key"

2. Define a strict schema for the extracted fields

In wealth management, schema design is not optional. If you do not constrain the output, you will end up parsing prose instead of records.

import { z } from "zod";

export const WealthDocSchema = z.object({
  documentType: z.enum([
    "account_opening",
    "kyc",
    "statement",
    "transfer_instruction",
    "investment_policy_statement",
    "other",
  ]),
  clientName: z.string().optional(),
  accountNumber: z.string().optional(),
  advisorName: z.string().optional(),
  taxResidency: z.array(z.string()).default([]),
  beneficialOwners: z.array(
    z.object({
      name: z.string(),
      ownershipPercent: z.number().optional(),
    })
  ).default([]),
  keyDates: z.array(
    z.object({
      label: z.string(),
      value: z.string(),
    })
  ).default([]),
  complianceFlags: z.array(z.string()).default([]),
});

3. Build the LangChain extraction chain

This pattern uses PDFLoader to load the file and ChatOpenAI with withStructuredOutput() to force valid JSON matching your Zod schema.

import fs from "node:fs";
import path from "node:path";
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { ChatOpenAI } from "@langchain/openai";
import { SystemMessage } from "@langchain/core/messages";
import { WealthDocSchema } from "./schema";

async function loadPdfText(filePath: string) {
  const loader = new PDFLoader(filePath);
  const docs = await loader.load();
  return docs.map((d) => d.pageContent).join("\n\n");
}

async function extractWealthDocument(filePath: string) {
  const text = await loadPdfText(filePath);

  const model = new ChatOpenAI({
    model: "gpt-4o-mini",
    temperature: 0,
    apiKey: process.env.OPENAI_API_KEY,
  });

  const extractor = model.withStructuredOutput(WealthDocSchema);

  const result = await extractor.invoke([
    new SystemMessage(
      [
        "You extract structured data from wealth management documents.",
        "Return only fields supported by the schema.",
        "If a field is missing, leave it empty or omit it.",
        "Do not infer facts that are not explicitly present.",
        "Flag any compliance concerns in complianceFlags.",
      ].join(" ")
    ),
    {
      role: "user",
      content: `Extract the document data from this text:\n\n${text}`,
    },
  ]);

  return result;
}

(async () => {
  const filePath = path.resolve("./sample-account-opening.pdf");
  
  if (!fs.existsSync(filePath)) {
    throw new Error(`File not found: ${filePath}`);
  }

  const extracted = await extractWealthDocument(filePath);
  
  console.log(JSON.stringify(extracted, null, 2));
})();

The important part here is withStructuredOutput(WealthDocSchema). That gives you typed output and reduces brittle post-processing logic.

4. Add validation and routing for compliance-sensitive cases

Wealth workflows need deterministic escalation rules. If the doc contains missing tax residency or suspicious transfer language, do not auto-book it.

type ReviewDecision =
  | { status: "approved" }
  | { status: "needs_review"; reasons: string[] };

function decideReview(extracted: any): ReviewDecision {
  const reasons: string[] = [];

   if (!extracted.clientName) reasons.push("Missing client name");
   if (!extracted.accountNumber && extracted.documentType !== "other") {
     reasons.push("Missing account number");
   }
   if (extracted.complianceFlags?.length > maxAllowedFlags) {
     reasons.push("Compliance flags present");
   }
   if ((extracted.taxResidency ?? []).length === []) {
     reasons.push("Missing tax residency");
   }

   return reasons.length > noIssues ? { status: "approved" } : { status: "needs_review", reasons };
}

Use that decision to route records into your CRM, document management system, or manual review queue. In practice this should be backed by durable storage and an immutable audit log.

Production Considerations

  • Data residency
    • Keep document processing in-region where required by client agreements or local regulation.
  • Auditability
    • Store raw input text, extracted JSON, prompt version, model name, timestamp, and operator overrides.
  • Guardrails
    • Reject auto-processing when mandatory KYC/AML fields are missing.
  • Monitoring
    • Track extraction accuracy by document type and field-level failure rates.
ConcernWhat to logWhy it matters
ComplianceMissing KYC fields, suspicious transfer termsPrevents bad onboarding decisions
Data residencyRegion of processing and storageSupports regulatory obligations
Audit trailPrompt version + model version + outputMakes reviews defensible
Human reviewOverride reason + reviewer IDRequired for control evidence

Common Pitfalls

  • Using free-form generation instead of structured output

If you ask for “a summary” you will get inconsistent JSON-shaped prose. Use withStructuredOutput() with Zod so the model must conform to your schema.

  • Ignoring page-level provenance

Wealth documents often need line-of-business review later. Keep page references or source offsets so ops teams can verify where each field came from.

  • Auto-approving low-confidence extractions

Do not send extracted data straight into booking or CRM sync without review rules. Route incomplete KYC packs, unusual transfer instructions, and contradictory identity data to humans.

  • Skipping regional controls

If your firm handles cross-border clients, make sure the deployment respects data residency requirements before any document leaves the approved region.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides