How to Build a document extraction Agent Using CrewAI in Python for wealth management

By Cyprian AaronsUpdated 2026-04-21

document-extractioncrewaipythonwealth-management

A document extraction agent for wealth management takes client PDFs, statements, KYC forms, suitability reports, and account opening packets, then turns them into structured data your downstream systems can trust. It matters because advisors and operations teams spend too much time rekeying data, and mistakes here create compliance risk, onboarding delays, and bad client records.

Architecture

•
Document intake layer
- •Accepts PDFs, scans, and text files from secure storage or an internal upload service.
- •Enforces file type checks and size limits before processing.
•
Extraction agent
- •Uses CrewAI Agent to read documents and extract named entities, account details, dates, holdings, risk profiles, and compliance fields.
- •Produces structured JSON instead of free-form summaries.
•
Validation layer
- •Checks extracted values against schema rules.
- •Flags missing mandatory fields like client name, account number, advisor name, and document date.
•
Audit trail store
- •Persists raw input metadata, extracted output, model version, timestamps, and reviewer actions.
- •Required for wealth management auditability and regulatory review.
•
Human review step
- •Routes low-confidence or high-risk documents to an operations analyst.
- •Prevents silent failures on sensitive onboarding or suitability documents.
•
Downstream integration
- •Pushes clean data into CRM, portfolio accounting, onboarding workflow tools, or compliance systems.

Implementation

•Install dependencies and define the extraction schema

Use CrewAI for orchestration and Pydantic for structured output validation. In production you want a strict schema so the agent cannot drift into narrative answers.

pip install crewai pydantic

from pydantic import BaseModel, Field
from typing import List, Optional

class WealthDocExtraction(BaseModel):
    client_name: str = Field(..., description="Full legal name of the client")
    document_type: str = Field(..., description="KYC form, statement, suitability report, etc.")
    account_number: Optional[str] = Field(None, description="Account or reference number")
    document_date: Optional[str] = Field(None, description="ISO date if available")
    advisor_name: Optional[str] = Field(None, description="Assigned advisor or relationship manager")
    holdings: List[str] = Field(default_factory=list)
    risk_profile: Optional[str] = None
    missing_fields: List[str] = Field(default_factory=list)

•Create the CrewAI agent with a narrow role

Keep the role specific. A general-purpose assistant is a bad fit for regulated extraction work because it increases hallucination risk.

from crewai import Agent

document_extractor = Agent(
    role="Wealth Management Document Extraction Specialist",
    goal="Extract structured data from wealth management documents with high accuracy",
    backstory=(
        "You extract client and account data from financial documents. "
        "You return only verified fields from the source text and flag missing data."
    ),
    verbose=True,
    allow_delegation=False,
)

•Build the task with explicit output requirements

The task should force structured output. For wealth management workflows you want the agent to return exactly what downstream systems expect.

from crewai import Task

extraction_task = Task(
    description=(
        "Extract key fields from the provided wealth management document text.\n"
        "Return client_name, document_type, account_number if present, document_date if present,\n"
        "advisor_name if present, holdings as a list of strings, risk_profile if present,\n"
        "and missing_fields for any required field not found.\n"
        "Do not invent values."
    ),
    expected_output=(
        "A valid JSON object matching the WealthDocExtraction schema."
    ),
    agent=document_extractor,
)

•Run the crew and validate the result

CrewAI’s Crew class coordinates execution. In this pattern you pass in OCR/text output from your document pipeline as context to the task execution step.

from crewai import Crew
from pydantic import ValidationError

def extract_document(document_text: str):
    crew = Crew(
        agents=[document_extractor],
        tasks=[extraction_task],
        verbose=True,
    )

    result = crew.kickoff(inputs={"document_text": document_text})

    # If your LLM returns JSON text, validate it here.
    # Depending on your CrewAI setup you may need to parse result.raw or result.
    parsed = WealthDocExtraction.model_validate_json(str(result))
    return parsed

sample_text = """
Client Name: Sarah Chen
Document Type: Advisory Agreement
Account Number: WM-88421
Document Date: 2025-03-12
Advisor Name: Michael Grant
Holdings: US Treasury Bond Fund; Global Equity ETF
Risk Profile: Moderate
"""

try:
    extracted = extract_document(sample_text)
    print(extracted.model_dump())
except ValidationError as e:
    print("Validation failed:", e)

Production Considerations

•
Audit everything
- •Store raw document hashes, extracted JSON, model name/version, prompt version, timestamp, and reviewer override decisions.
- •Wealth management teams need traceability for disputes and regulator requests.
•
Add confidence-based routing
- •Send documents with missing mandatory fields or ambiguous values to human review.
- •Use stricter thresholds for onboarding forms than for low-risk internal statements.
•
Control data residency
- •Keep document processing in approved regions only.
- •If client documents contain PII or investment data subject to jurisdictional constraints, do not send them to unmanaged endpoints.
•
Put guardrails around extraction scope
- •Restrict the agent to extraction only; no advice generation.
- •Block it from inferring suitability conclusions or recommending products.

Common Pitfalls

•
Using free-form LLM output in production
- •Mistake: letting the agent return paragraphs that another service has to parse.
- •Fix: require a strict schema like WealthDocExtraction and reject invalid payloads immediately.
•
Skipping OCR/text normalization
- •Mistake: feeding raw scanned PDFs directly into the agent.
- •Fix: run OCR first and normalize line breaks, headers, footers, and page numbers before extraction.
•
Ignoring compliance exceptions
- •Mistake: treating every field miss as a normal failure.
- •Fix: classify exceptions by severity. Missing account numbers on onboarding docs should trigger escalation; missing optional advisor names may not.
•
Not separating extraction from business logic
- •Mistake: asking the agent to both extract data and decide whether a client is suitable.
- •Fix: keep extraction deterministic. Suitability checks belong in rule engines or compliance workflows after validation.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit