How to Build a document extraction Agent Using LangChain in TypeScript for wealth management
A document extraction agent for wealth management takes unstructured files like account opening forms, IPS documents, statements, KYC packs, and transfer instructions, then turns them into structured data your downstream systems can trust. That matters because the business is full of high-value workflows where a missed field, wrong beneficiary name, or incomplete compliance check creates operational risk, audit pain, and client friction.
Architecture
A production agent for this use case needs a narrow, auditable pipeline:
- •
Document ingestion layer
- •Accept PDFs, scans, and email attachments.
- •Normalize inputs before extraction.
- •Preserve source metadata like filename, upload time, and client ID.
- •
Text extraction layer
- •Use OCR for scanned docs.
- •Extract text with page boundaries intact.
- •Keep raw text alongside normalized text for audit.
- •
LLM extraction chain
- •Use LangChain to map document text into a strict schema.
- •Enforce structured output for fields like account number, tax residency, beneficial owner, and advisor notes.
- •
Validation and policy layer
- •Validate required fields.
- •Flag missing compliance items.
- •Reject or route ambiguous outputs to human review.
- •
Audit and storage layer
- •Store extracted JSON plus source references.
- •Persist model version, prompt version, and timestamps.
- •Keep an immutable trail for compliance review.
- •
Human-in-the-loop review queue
- •Send low-confidence or policy-sensitive cases to operations staff.
- •Allow corrections before downstream booking or CRM updates.
Implementation
1. Install the right packages
Use LangChain’s TypeScript packages plus a PDF loader. For wealth management you want deterministic extraction, so avoid free-form chat responses.
npm install langchain @langchain/openai @langchain/core pdf-parse zod
Set your environment variables:
export OPENAI_API_KEY="your-key"
2. Define a strict schema for the extracted fields
In wealth management, schema design is not optional. If you do not constrain the output, you will end up parsing prose instead of records.
import { z } from "zod";
export const WealthDocSchema = z.object({
documentType: z.enum([
"account_opening",
"kyc",
"statement",
"transfer_instruction",
"investment_policy_statement",
"other",
]),
clientName: z.string().optional(),
accountNumber: z.string().optional(),
advisorName: z.string().optional(),
taxResidency: z.array(z.string()).default([]),
beneficialOwners: z.array(
z.object({
name: z.string(),
ownershipPercent: z.number().optional(),
})
).default([]),
keyDates: z.array(
z.object({
label: z.string(),
value: z.string(),
})
).default([]),
complianceFlags: z.array(z.string()).default([]),
});
3. Build the LangChain extraction chain
This pattern uses PDFLoader to load the file and ChatOpenAI with withStructuredOutput() to force valid JSON matching your Zod schema.
import fs from "node:fs";
import path from "node:path";
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { ChatOpenAI } from "@langchain/openai";
import { SystemMessage } from "@langchain/core/messages";
import { WealthDocSchema } from "./schema";
async function loadPdfText(filePath: string) {
const loader = new PDFLoader(filePath);
const docs = await loader.load();
return docs.map((d) => d.pageContent).join("\n\n");
}
async function extractWealthDocument(filePath: string) {
const text = await loadPdfText(filePath);
const model = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
apiKey: process.env.OPENAI_API_KEY,
});
const extractor = model.withStructuredOutput(WealthDocSchema);
const result = await extractor.invoke([
new SystemMessage(
[
"You extract structured data from wealth management documents.",
"Return only fields supported by the schema.",
"If a field is missing, leave it empty or omit it.",
"Do not infer facts that are not explicitly present.",
"Flag any compliance concerns in complianceFlags.",
].join(" ")
),
{
role: "user",
content: `Extract the document data from this text:\n\n${text}`,
},
]);
return result;
}
(async () => {
const filePath = path.resolve("./sample-account-opening.pdf");
if (!fs.existsSync(filePath)) {
throw new Error(`File not found: ${filePath}`);
}
const extracted = await extractWealthDocument(filePath);
console.log(JSON.stringify(extracted, null, 2));
})();
The important part here is withStructuredOutput(WealthDocSchema). That gives you typed output and reduces brittle post-processing logic.
4. Add validation and routing for compliance-sensitive cases
Wealth workflows need deterministic escalation rules. If the doc contains missing tax residency or suspicious transfer language, do not auto-book it.
type ReviewDecision =
| { status: "approved" }
| { status: "needs_review"; reasons: string[] };
function decideReview(extracted: any): ReviewDecision {
const reasons: string[] = [];
if (!extracted.clientName) reasons.push("Missing client name");
if (!extracted.accountNumber && extracted.documentType !== "other") {
reasons.push("Missing account number");
}
if (extracted.complianceFlags?.length > maxAllowedFlags) {
reasons.push("Compliance flags present");
}
if ((extracted.taxResidency ?? []).length === []) {
reasons.push("Missing tax residency");
}
return reasons.length > noIssues ? { status: "approved" } : { status: "needs_review", reasons };
}
Use that decision to route records into your CRM, document management system, or manual review queue. In practice this should be backed by durable storage and an immutable audit log.
Production Considerations
- •Data residency
- •Keep document processing in-region where required by client agreements or local regulation.
- •Auditability
- •Store raw input text, extracted JSON, prompt version, model name, timestamp, and operator overrides.
- •Guardrails
- •Reject auto-processing when mandatory KYC/AML fields are missing.
- •Monitoring
- •Track extraction accuracy by document type and field-level failure rates.
| Concern | What to log | Why it matters |
|---|---|---|
| Compliance | Missing KYC fields, suspicious transfer terms | Prevents bad onboarding decisions |
| Data residency | Region of processing and storage | Supports regulatory obligations |
| Audit trail | Prompt version + model version + output | Makes reviews defensible |
| Human review | Override reason + reviewer ID | Required for control evidence |
Common Pitfalls
- •Using free-form generation instead of structured output
If you ask for “a summary” you will get inconsistent JSON-shaped prose. Use withStructuredOutput() with Zod so the model must conform to your schema.
- •Ignoring page-level provenance
Wealth documents often need line-of-business review later. Keep page references or source offsets so ops teams can verify where each field came from.
- •Auto-approving low-confidence extractions
Do not send extracted data straight into booking or CRM sync without review rules. Route incomplete KYC packs, unusual transfer instructions, and contradictory identity data to humans.
- •Skipping regional controls
If your firm handles cross-border clients, make sure the deployment respects data residency requirements before any document leaves the approved region.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit