How to Build a document extraction Agent Using CrewAI in TypeScript for wealth management

By Cyprian AaronsUpdated 2026-04-21
document-extractioncrewaitypescriptwealth-management

A document extraction agent for wealth management takes client PDFs, statements, KYC packs, account opening forms, and advisory letters, then turns them into structured data your downstream systems can trust. That matters because wealth firms live and die on accuracy, auditability, and turnaround time: bad extraction slows onboarding, breaks suitability workflows, and creates compliance risk.

Architecture

Build this agent with a narrow, auditable pipeline:

  • Document intake layer

    • Accept PDFs, scans, and text files from secure storage or an internal upload service.
    • Enforce tenant isolation and region-specific storage for data residency.
  • Pre-processing layer

    • OCR scanned pages.
    • Split documents into logical chunks by page or section.
    • Normalize text before the LLM sees it.
  • CrewAI orchestration layer

    • Use a Crew with one or more Agent instances focused on extraction, validation, and compliance checks.
    • Keep tasks small and deterministic.
  • Schema validation layer

    • Convert extracted output into typed TypeScript objects.
    • Reject missing fields like client name, account number, risk profile, or advisor signature date.
  • Audit and lineage layer

    • Persist input hashes, prompt versions, model versions, and output diffs.
    • Store every extraction event for review by operations and compliance.

Implementation

1) Install dependencies and define the data model

You want strict types before you touch any LLM output. For wealth management documents, the schema usually includes identity fields, account metadata, advisor details, and compliance markers.

npm install @crewai/core zod pdf-parse
import { z } from "zod";

export const WealthDocSchema = z.object({
  clientName: z.string().min(1),
  accountNumber: z.string().min(1),
  documentType: z.enum(["KYC", "Statement", "AccountOpening", "AdvisoryLetter"]),
  advisorName: z.string().optional(),
  effectiveDate: z.string().min(1),
  riskProfile: z.enum(["Conservative", "Balanced", "Growth"]).optional(),
  jurisdiction: z.string().min(1),
});

export type WealthDoc = z.infer<typeof WealthDocSchema>;

2) Extract text from the document

For production use, keep OCR outside the agent if possible. The agent should reason over clean text, not raw PDFs.

import fs from "node:fs/promises";
import pdfParse from "pdf-parse";

export async function extractTextFromPdf(path: string): Promise<string> {
  const buffer = await fs.readFile(path);
  const parsed = await pdfParse(buffer);
  return parsed.text;
}

3) Build CrewAI agents and tasks in TypeScript

This is the core pattern: one agent extracts structured fields, another validates against wealth-management rules. The exact imports can vary by package version, but the API shape uses Agent, Task, Crew, and Process.

import { Agent, Task, Crew, Process } from "@crewai/core";
import { WealthDocSchema } from "./schema";

const extractor = new Agent({
  role: "Document Extraction Specialist",
  goal: "Extract structured wealth management fields from client documents",
  backstory:
    "You extract only what is present in the source text. You do not invent missing fields.",
});

const validator = new Agent({
  role: "Compliance Validation Specialist",
  goal: "Check extracted fields for completeness and wealth management compliance",
  backstory:
    "You verify that outputs are suitable for downstream onboarding and audit review.",
});

export async function runExtraction(documentText: string) {
  const extractionTask = new Task({
    description: `
Extract these fields from the document:
- clientName
- accountNumber
- documentType
- advisorName if present
- effectiveDate
- riskProfile if present
- jurisdiction

Return valid JSON only.
Document:
${documentText}
`,
    expectedOutput: "JSON object matching the required schema",
    agent: extractor,
  });

  const validationTask = new Task({
    description:
      "Validate the extracted JSON for completeness and wealth-management suitability. Flag missing required fields or suspicious values.",
    expectedOutput: "Validation notes with pass/fail status",
    agent: validator,
    context: [extractionTask],
  });

   const crew = new Crew({
    agents: [extractor, validator],
    tasks: [extractionTask, validationTask],
    process: Process.sequential,
   });

   const result = await crew.kickoff();

   return result;
}

4) Parse output and enforce schema before persistence

Never write raw model output directly to your database. Validate it first with Zod or your preferred schema library.

import { WealthDocSchema } from "./schema";

export async function handleDocument(path: string) {
  const text = await extractTextFromPdf(path);
  const crewResult = await runExtraction(text);

  // Depending on your CrewAI version this may be a string or structured object.
  const rawJson =
    typeof crewResult === "string" ? JSON.parse(crewResult) : crewResult;

  const parsed = WealthDocSchema.safeParse(rawJson);

  if (!parsed.success) {
    throw new Error(`Invalid extraction output: ${parsed.error.message}`);
  }

  return parsed.data;
}

Production Considerations

  • Data residency

    • Keep documents in-region if you operate across multiple jurisdictions.
    • Pin model endpoints and storage to approved regions for EU/UK/US policy separation.
  • Auditability

    • Log document hash, prompt version, task IDs (Task), model name, timestamp, and final structured payload.
  • Compliance guardrails

    • Add deterministic checks for required KYC fields before routing to onboarding.
    • Flag documents that mention trusts, POAs, sanctions language, or discretionary mandates for manual review.
  • Monitoring

    • Track field-level accuracy by document type.
    • Alert on spikes in null accountNumber, malformed dates, or unusually high manual override rates.

Common Pitfalls

  • Letting the agent infer missing data

    If a statement does not contain a risk profile or jurisdiction explicitly, do not ask the model to “fill it in.” Require nulls plus a review flag instead of fabricated values.

  • Skipping schema validation

    CrewAI will give you generated content; your app must decide whether it is acceptable. Use strict parsing with Zod before anything reaches CRM, portfolio systems, or onboarding workflows.

  • Mixing extraction with business decisions

    Keep extraction separate from suitability assessment or approval logic. The agent should extract facts; downstream rules engines should decide whether a case passes compliance thresholds.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides