How to Build a KYC verification Agent Using LangChain in Python for investment banking

By Cyprian AaronsUpdated 2026-04-21
kyc-verificationlangchainpythoninvestment-banking

A KYC verification agent automates the first pass of client due diligence: it extracts identity data from documents, checks it against policy rules, flags mismatches, and produces an audit-friendly decision trail. In investment banking, that matters because onboarding delays cost revenue, weak KYC creates regulatory exposure, and every decision needs to be explainable to compliance and audit teams.

Architecture

  • Document ingestion layer

    • Accepts passports, utility bills, corporate registration docs, source-of-funds statements, and beneficial ownership forms.
    • Normalizes PDFs, images, and text into a single text payload for downstream processing.
  • Extraction chain

    • Uses LangChain to pull structured fields like name, DOB, address, nationality, entity type, UBOs, and document expiry.
    • Outputs strict JSON so the rest of the workflow can validate deterministically.
  • Policy/rules engine

    • Applies bank-specific KYC rules: required fields present, sanction-sensitive geographies, expired documents, PEP indicators, mismatch thresholds.
    • Keeps hard compliance logic outside the model.
  • Risk classification step

    • Assigns a risk level: low, medium, high.
    • Explains why the case was escalated using evidence from the extracted data.
  • Audit logging layer

    • Stores prompts, model outputs, rule decisions, timestamps, document hashes, and reviewer overrides.
    • This is non-negotiable for investment banking audits.
  • Human review handoff

    • Routes exceptions to compliance analysts when confidence is low or policy violations are detected.
    • Prevents the agent from making final onboarding decisions on its own.

Implementation

1) Install the core dependencies

Use LangChain with a model provider that supports structured output. For this example I’ll use OpenAI via langchain-openai, plus Pydantic for schema validation.

pip install langchain langchain-openai pydantic python-dotenv

Set your API key in the environment:

export OPENAI_API_KEY="your-key"

2) Define the KYC schema and extraction chain

The key pattern here is ChatPromptTemplate + with_structured_output(). That gives you typed output instead of brittle free-form text parsing.

from typing import List, Optional
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

class KYCProfile(BaseModel):
    full_name: str = Field(..., description="Customer or entity name")
    date_of_birth: Optional[str] = Field(None, description="YYYY-MM-DD")
    nationality: Optional[str] = None
    address: Optional[str] = None
    document_type: Optional[str] = None
    document_number: Optional[str] = None
    expiry_date: Optional[str] = None
    pep_indicator: bool = False
    sanctions_risk_countries: List[str] = []
    beneficial_owners: List[str] = []
    notes: str = ""

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract KYC fields from the provided customer onboarding document. "
               "Return only data supported by the text. If unknown, use null."),
    ("human", "{document_text}")
])

kyc_extractor = prompt | llm.with_structured_output(KYCProfile)

3) Add deterministic compliance checks

This is where you keep regulatory logic out of the model. The LLM extracts; Python decides.

from datetime import date

HIGH_RISK_COUNTRIES = {"IR", "KP", "SY"}
REQUIRED_FIELDS = ["full_name", "document_type", "document_number"]

def evaluate_kyc(profile: KYCProfile) -> dict:
    issues = []

    for field in REQUIRED_FIELDS:
        if not getattr(profile, field):
            issues.append(f"missing_required_field:{field}")

    if profile.sanctions_risk_countries:
        risky = [c for c in profile.sanctions_risk_countries if c in HIGH_RISK_COUNTRIES]
        if risky:
            issues.append(f"sanctions_country_match:{','.join(risky)}")

    if profile.pep_indicator:
        issues.append("pep_indicator_true")

    if profile.expiry_date:
        try:
            exp_year, exp_month, exp_day = map(int, profile.expiry_date.split("-"))
            if date(exp_year, exp_month, exp_day) < date.today():
                issues.append("document_expired")
        except ValueError:
            issues.append("invalid_expiry_date_format")

    risk_level = "low"
    if any(i.startswith("sanctions_country_match") or i == "pep_indicator_true" for i in issues):
        risk_level = "high"
    elif issues:
        risk_level = "medium"

    return {
        "risk_level": risk_level,
        "issues": issues,
        "needs_human_review": risk_level != "low",
    }

4) Run the agent end-to-end and persist an audit trail

For production you want each case to produce a reproducible record. At minimum store input hash, extracted output, rule results, and final disposition.

import json
import hashlib
from pathlib import Path

AUDIT_DIR = Path("./audit_logs")
AUDIT_DIR.mkdir(exist_ok=True)

def hash_text(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()

def process_kyc_case(document_text: str) -> dict:
    profile = kyc_extractor.invoke({"document_text": document_text})
    assessment = evaluate_kyc(profile)

    record = {
        "input_hash": hash_text(document_text),
        "profile": profile.model_dump(),
        "assessment": assessment,
        "final_decision": "review" if assessment["needs_human_review"] else "approve",
    }

    path = AUDIT_DIR / f"{record['input_hash']}.json"
    path.write_text(json.dumps(record, indent=2), encoding="utf-8")
    return record

sample_doc = """
Customer Name: Amina Khan
Date of Birth: 1990-04-11
Nationality: GB
Address: 10 King Street London SW1A 1AA
Document Type: Passport
Document Number: UK1234567
Expiry Date: 2028-06-30
PEP Indicator: No
Beneficial Owners: None disclosed
"""

result = process_kyc_case(sample_doc)
print(result)

If you want a cleaner orchestration layer later, wrap these steps in a LangChain RunnableSequence and attach tracing through LangSmith. The important part is that extraction remains model-driven while compliance decisions stay deterministic.

Production Considerations

  • Deploy in-region for data residency

    • Investment banks often require customer data to stay within specific jurisdictions.
    • Keep model endpoints, vector stores, logs, and audit artifacts in approved regions only.
  • Trace everything

    • Enable LangSmith tracing or equivalent internal observability.
    • Log prompt versioning, model versioning, output schema versioning, and reviewer overrides so audit can reconstruct every decision.
  • Add hard guardrails

    • Never let the LLM approve onboarding directly.
    • Use allowlisted outputs via Pydantic schemas and block any response that fails validation or contains unsupported claims.
  • Separate PII from analytics

    • Redact or tokenize sensitive fields before sending them to non-production tools.
    • Limit access by role; compliance analysts need case details but engineers do not need raw passport numbers in dashboards.

Common Pitfalls

  1. Using the model as the decision-maker

    • Mistake: asking the LLM to decide “approve” or “reject” without rules.
    • Fix: make it extract facts only; let Python policy code decide based on bank-approved thresholds.
  2. Accepting free-form output

    • Mistake: parsing plain text with regex after the fact.
    • Fix: use with_structured_output() with a Pydantic schema so invalid responses fail fast.
  3. Ignoring audit requirements

    • Mistake: storing only the final result.
    • Fix: persist input hashes, extracted fields, rule hits, timestamps, reviewer actions, and model/version metadata.
  4. Sending sensitive data to uncontrolled services

    • Mistake: routing KYC docs through random third-party tools or unapproved regions.
    • Fix: enforce approved infrastructure only and align storage/processing with compliance and data residency requirements.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides