How to Build a KYC verification Agent Using LangChain in Python for wealth management

By Cyprian AaronsUpdated 2026-04-21

kyc-verificationlangchainpythonwealth-management

A KYC verification agent for wealth management collects client identity data, checks it against policy, flags missing or inconsistent information, and produces an auditable decision trail. That matters because onboarding high-net-worth clients is not just a UX problem; it is a compliance gate tied to AML, suitability, source-of-funds checks, and jurisdiction-specific obligations.

Architecture

•
Input normalization layer
- •Takes raw client data from CRM forms, PDFs, email extracts, or onboarding portals.
- •Converts it into a consistent schema: name, DOB, address, nationality, tax residency, beneficial owners, source of wealth.
•
Policy and rules engine
- •Encodes firm-specific KYC requirements by product type and jurisdiction.
- •Decides whether the case is complete, needs manual review, or must be rejected.
•
LLM orchestration layer
- •Uses LangChain to classify documents, extract entities, and summarize gaps.
- •Keeps the model on a tight leash with structured outputs.
•
Evidence retrieval layer
- •Pulls internal policy docs, onboarding checklists, and regulatory playbooks from a vector store.
- •Grounds decisions in approved firm content instead of free-form model memory.
•
Audit logging layer
- •Stores every prompt, retrieved document reference, extracted field, and final decision.
- •Gives compliance teams a defensible trail for reviews and regulator requests.
•
Human review handoff
- •Routes borderline cases to ops or compliance analysts.
- •Prevents the agent from making final decisions on ambiguous or high-risk cases.

Implementation

1. Define the KYC schema and load policy context

For wealth management, you want structured extraction first. Do not let the model “chat” its way through onboarding; force it into fields you can validate.

from typing import List, Optional
from pydantic import BaseModel, Field

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import PydanticOutputParser


class KYCProfile(BaseModel):
    full_name: str = Field(description="Client legal full name")
    date_of_birth: str = Field(description="Date of birth in ISO format YYYY-MM-DD")
    address: str = Field(description="Residential address")
    nationality: str
    tax_residency: List[str]
    source_of_wealth: Optional[str] = None
    source_of_funds: Optional[str] = None
    pep_status: bool = False
    sanctions_hit: bool = False
    missing_fields: List[str] = []


parser = PydanticOutputParser(pydantic_object=KYCProfile)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You extract KYC fields for wealth management onboarding. "
               "Return only data that appears in the input."),
    ("human", "{client_packet}\n\n{format_instructions}")
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

chain = prompt | llm | parser

This pattern uses ChatPromptTemplate, ChatOpenAI, and PydanticOutputParser to keep output structured. That is the baseline you want before adding retrieval or automation.

2. Add retrieval over internal KYC policy documents

The agent should not invent policy. Use RetrievalQA-style patterns or LCEL composition with a retriever backed by your approved policy corpus.

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

policy_docs = [
    Document(page_content="High-risk jurisdictions require enhanced due diligence."),
    Document(page_content="PEP cases require manual compliance approval."),
    Document(page_content="Source of wealth evidence must be retained for seven years.")
]

vectorstore = FAISS.from_documents(policy_docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

def assess_case(client_packet: str):
    extracted = chain.invoke({
        "client_packet": client_packet,
        "format_instructions": parser.get_format_instructions()
    })

    relevant_policies = retriever.invoke(client_packet)
    policy_text = "\n".join(doc.page_content for doc in relevant_policies)

    decision_prompt = ChatPromptTemplate.from_messages([
        ("system", "You are a KYC reviewer for a wealth management firm. "
                   "Use only provided policy text."),
        ("human", "Client profile:\n{profile}\n\nPolicy:\n{policy}\n\n"
                  "Decide if this case is PASS, REVIEW, or REJECT. "
                  "Explain briefly and list missing evidence.")
    ])

    decision_chain = decision_prompt | llm
    return extracted, decision_chain.invoke({
        "profile": extracted.model_dump(),
        "policy": policy_text
    })

This gives you grounded decisions based on internal material. In regulated workflows, that matters more than model fluency.

3. Add deterministic rules before any final outcome

Wealth management KYC has hard stops. If sanctions are hit or mandatory fields are missing, do not ask the LLM to “reason it out.”

MANDATORY_FIELDS = ["full_name", "date_of_birth", "address", "nationality"]

def rule_check(profile: KYCProfile) -> str:
    missing = [f for f in MANDATORY_FIELDS if not getattr(profile, f)]
    if profile.sanctions_hit:
        return "REJECT"
    if profile.pep_status:
        return "REVIEW"
    if missing:
        return "REVIEW"
    return "PASS"

client_packet = """
Full name: Sarah Malik
DOB: 1982-04-11
Address: 12 King Street, London
Nationality: British
Tax residency: UK
Source of wealth: Family office distributions
PEP status: false
Sanctions hit: false
"""

extracted_profile = chain.invoke({
    "client_packet": client_packet,
    "format_instructions": parser.get_format_instructions()
})

rule_result = rule_check(extracted_profile)
print(rule_result)
print(extracted_profile.model_dump())

This hybrid approach is what you want in production. Rules handle mandatory controls; LangChain handles extraction and explanation.

4. Emit audit records for compliance review

Every case needs traceability. Store the input hash, model version, retrieved policy IDs, output JSON, and final disposition.

import hashlib
import json
from datetime import datetime

def audit_record(client_packet: str, profile: KYCProfile, decision_text: str):
    return {
        "timestamp_utc": datetime.utcnow().isoformat(),
        "input_sha256": hashlib.sha256(client_packet.encode()).hexdigest(),
        "profile": profile.model_dump(),
        "decision_text": decision_text.content if hasattr(decision_text, "content") else str(decision_text),
        "model": llm.model_name,
        "status": rule_check(profile),
    }

extracted_profile, decision = assess_case(client_packet)
record = audit_record(client_packet, extracted_profile["profile"] if isinstance(extracted_profile, dict) else extracted_profile,
                      decision)
print(json.dumps(record, indent=2))

In a real system this record goes to immutable storage or your GRC platform. Compliance teams will ask for it eventually.

Production Considerations

•
Data residency
- •Keep client packets and embeddings inside the required jurisdiction.
- •For cross-border wealth platforms, separate EU/UK/US stores and avoid sending raw PII across regions.
•
Auditability
- •Log prompt templates, retrieved policy snippets, model version IDs, and final dispositions.
- •Make sure reviewers can reconstruct why a case was marked PASS or REVIEW.
•
Guardrails
- •Use deterministic rules for sanctions hits, expired documents, missing mandatory fields, and age thresholds.
- •Never let the LLM override hard compliance controls.
•
Monitoring
- •Track false positives on PEP/sanctions classification and manual review rates by region.
- •Watch for drift when onboarding patterns change after product launches or regulatory updates.

Common Pitfalls

•
Letting the model decide compliance outcomes end-to-end
- •Avoid this by splitting extraction from decisioning.
- •Use rules for hard stops and reserve LLM output for classification and summarization.
•
Using generic prompts without firm policy grounding
- •A generic “is this client okay?” prompt will drift fast.
- •Retrieve internal KYC policies with as_retriever() so decisions reflect your actual operating model.
•
Skipping structured outputs
- •Free-text responses are painful to validate and impossible to automate safely.
- •Use PydanticOutputParser so downstream code can enforce required fields before routing a case.

If you build it this way, LangChain becomes the orchestration layer around controlled compliance logic instead of a black box making risky calls on client onboarding. For wealth management teams handling sensitive PII and regulated decisions across jurisdictions that is the difference between a useful agent and an audit problem.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit