How to Build a document extraction Agent Using LlamaIndex in Python for wealth management

By Cyprian AaronsUpdated 2026-04-21
document-extractionllamaindexpythonwealth-management

A document extraction agent for wealth management takes client PDFs, statements, KYC packs, IPS documents, and onboarding forms, then turns them into structured fields your downstream systems can trust. That matters because most operational risk in wealth firms comes from manual rekeying, inconsistent interpretation, and missing audit trails across client servicing, compliance, and reporting.

Architecture

A production-grade extraction agent for wealth management needs these pieces:

  • Document ingestion layer

    • Accept PDFs, scans, and office documents from secure storage or internal upload portals.
    • Normalize file access before parsing.
  • Document parser

    • Use LlamaIndex readers/loaders to turn files into Document objects.
    • Preserve source metadata like filename, page numbers, and timestamps.
  • Extraction schema

    • Define the fields you want back: client name, account number, tax residency, risk profile, beneficial owner details, etc.
    • Keep the schema stable so downstream systems do not break.
  • LLM extraction engine

    • Use LlamaIndex’s structured prediction path to map unstructured text into typed Python models.
    • This is where the agent converts narrative content into machine-readable output.
  • Validation and compliance layer

    • Check required fields, format constraints, and policy rules.
    • Flag missing KYC data or suspicious mismatches for human review.
  • Audit logging layer

    • Store source document IDs, extracted values, model version, prompt version, and reviewer decisions.
    • Wealth management teams need traceability for internal audit and regulators.

Implementation

1. Install dependencies and set up the LLM

Use LlamaIndex with a model that supports structured output well. In practice, OpenAI models are common for extraction workflows because they handle schema-constrained responses reliably.

pip install llama-index llama-index-llms-openai pydantic python-dotenv
import os
from dotenv import load_dotenv

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

load_dotenv()

Settings.llm = OpenAI(
    model="gpt-4o-mini",
    temperature=0,
    api_key=os.environ["OPENAI_API_KEY"],
)

2. Define the wealth management extraction schema

Keep the schema explicit. If your operations team needs it in a case management system or CRM, define exactly what they need up front.

from typing import List, Optional
from pydantic import BaseModel, Field

class BeneficialOwner(BaseModel):
    full_name: str = Field(..., description="Beneficial owner's full legal name")
    ownership_percentage: Optional[float] = Field(None, description="Ownership percentage if stated")

class WealthClientProfile(BaseModel):
    client_name: str = Field(..., description="Primary client or entity name")
    account_number: Optional[str] = Field(None, description="Account or portfolio number")
    tax_residency: List[str] = Field(default_factory=list)
    risk_profile: Optional[str] = Field(None, description="Conservative, balanced, growth, etc.")
    kyc_status: Optional[str] = Field(None, description="KYC status stated in the document")
    beneficial_owners: List[BeneficialOwner] = Field(default_factory=list)
    source_document_type: Optional[str] = Field(None)

3. Load documents and run structured extraction with LlamaIndex

For local files or uploaded PDFs, use SimpleDirectoryReader. Then pass the resulting Document objects to astructured_predict, which is the core pattern for extraction.

from llama_index.core import SimpleDirectoryReader
from llama_index.core.prompts import PromptTemplate

docs = SimpleDirectoryReader(
    input_dir="./wealth_docs",
    recursive=True
).load_data()

extraction_prompt = PromptTemplate(
    """
You are extracting structured data from wealth management documents.
Return only fields supported by the schema.

Document text:
{context_str}
"""
)

result = Settings.llm.astructured_predict(
    WealthClientProfile,
    prompt=extraction_prompt,
    context_str="\n\n".join(doc.text for doc in docs[:3]),
)

print(result.model_dump())

That pattern is enough for a first pass when documents are short or already OCR’d. For longer files, chunk first and extract per section so you do not exceed context limits or lose page-level provenance.

4. Add validation and audit logging

Extraction without validation is how bad data enters portfolio systems. Validate critical fields before writing anything downstream.

from datetime import datetime
import json

def validate_profile(profile: WealthClientProfile) -> list[str]:
    errors = []
    if not profile.client_name:
        errors.append("client_name is required")
    if profile.tax_residency == []:
        errors.append("tax_residency is required")
    if profile.risk_profile and profile.risk_profile.lower() not in {
        "conservative", "balanced", "growth", "aggressive"
    }:
        errors.append(f"unexpected risk_profile: {profile.risk_profile}")
    return errors

errors = validate_profile(result)

audit_record = {
    "timestamp": datetime.utcnow().isoformat(),
    "document_count": len(docs),
    "extracted": result.model_dump(),
    "validation_errors": errors,
}

with open("audit_log.jsonl", "a") as f:
    f.write(json.dumps(audit_record) + "\n")

Production Considerations

  • Data residency

    • Keep client documents in-region if your firm operates under local banking secrecy or cross-border privacy rules.
    • If you use a hosted LLM API, confirm where prompts and outputs are processed and retained.
  • Auditability

    • Log document hashes, extraction outputs, prompt versions, model versions, and reviewer overrides.
    • In wealth management, you need to show why a field was extracted a certain way during an audit or dispute.
  • Human-in-the-loop review

    • Route low-confidence or policy-sensitive cases to operations staff.
    • Examples: politically exposed persons (PEP) references, tax residency conflicts, missing beneficial ownership data.
  • Guardrails

    • Reject writes when required compliance fields are missing.
    • Never auto-update client records from unverified extractions without approval gates.

Common Pitfalls

  1. Using raw OCR text without page/source metadata

    • You lose traceability when compliance asks where a field came from.
    • Fix it by storing page numbers and original filenames alongside every extracted value.
  2. Treating all documents the same

    • A quarterly statement is not an onboarding pack.
    • Use document-type routing so your prompts and schemas match the source material.
  3. Skipping validation before persistence

    • LLMs will occasionally infer missing values or normalize names incorrectly.
    • Always validate against business rules before pushing data into CRM, portfolio accounting, or KYC systems.
  4. Ignoring model drift and prompt drift

    • A small change in prompt wording can alter extraction quality.
    • Version prompts like code and keep regression tests on representative wealth-management documents.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides