How to Build a KYC verification Agent Using LangChain in Python for investment banking
A KYC verification agent automates the first pass of client due diligence: it extracts identity data from documents, checks it against policy rules, flags mismatches, and produces an audit-friendly decision trail. In investment banking, that matters because onboarding delays cost revenue, weak KYC creates regulatory exposure, and every decision needs to be explainable to compliance and audit teams.
Architecture
- •
Document ingestion layer
- •Accepts passports, utility bills, corporate registration docs, source-of-funds statements, and beneficial ownership forms.
- •Normalizes PDFs, images, and text into a single text payload for downstream processing.
- •
Extraction chain
- •Uses LangChain to pull structured fields like name, DOB, address, nationality, entity type, UBOs, and document expiry.
- •Outputs strict JSON so the rest of the workflow can validate deterministically.
- •
Policy/rules engine
- •Applies bank-specific KYC rules: required fields present, sanction-sensitive geographies, expired documents, PEP indicators, mismatch thresholds.
- •Keeps hard compliance logic outside the model.
- •
Risk classification step
- •Assigns a risk level: low, medium, high.
- •Explains why the case was escalated using evidence from the extracted data.
- •
Audit logging layer
- •Stores prompts, model outputs, rule decisions, timestamps, document hashes, and reviewer overrides.
- •This is non-negotiable for investment banking audits.
- •
Human review handoff
- •Routes exceptions to compliance analysts when confidence is low or policy violations are detected.
- •Prevents the agent from making final onboarding decisions on its own.
Implementation
1) Install the core dependencies
Use LangChain with a model provider that supports structured output. For this example I’ll use OpenAI via langchain-openai, plus Pydantic for schema validation.
pip install langchain langchain-openai pydantic python-dotenv
Set your API key in the environment:
export OPENAI_API_KEY="your-key"
2) Define the KYC schema and extraction chain
The key pattern here is ChatPromptTemplate + with_structured_output(). That gives you typed output instead of brittle free-form text parsing.
from typing import List, Optional
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
class KYCProfile(BaseModel):
full_name: str = Field(..., description="Customer or entity name")
date_of_birth: Optional[str] = Field(None, description="YYYY-MM-DD")
nationality: Optional[str] = None
address: Optional[str] = None
document_type: Optional[str] = None
document_number: Optional[str] = None
expiry_date: Optional[str] = None
pep_indicator: bool = False
sanctions_risk_countries: List[str] = []
beneficial_owners: List[str] = []
notes: str = ""
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "Extract KYC fields from the provided customer onboarding document. "
"Return only data supported by the text. If unknown, use null."),
("human", "{document_text}")
])
kyc_extractor = prompt | llm.with_structured_output(KYCProfile)
3) Add deterministic compliance checks
This is where you keep regulatory logic out of the model. The LLM extracts; Python decides.
from datetime import date
HIGH_RISK_COUNTRIES = {"IR", "KP", "SY"}
REQUIRED_FIELDS = ["full_name", "document_type", "document_number"]
def evaluate_kyc(profile: KYCProfile) -> dict:
issues = []
for field in REQUIRED_FIELDS:
if not getattr(profile, field):
issues.append(f"missing_required_field:{field}")
if profile.sanctions_risk_countries:
risky = [c for c in profile.sanctions_risk_countries if c in HIGH_RISK_COUNTRIES]
if risky:
issues.append(f"sanctions_country_match:{','.join(risky)}")
if profile.pep_indicator:
issues.append("pep_indicator_true")
if profile.expiry_date:
try:
exp_year, exp_month, exp_day = map(int, profile.expiry_date.split("-"))
if date(exp_year, exp_month, exp_day) < date.today():
issues.append("document_expired")
except ValueError:
issues.append("invalid_expiry_date_format")
risk_level = "low"
if any(i.startswith("sanctions_country_match") or i == "pep_indicator_true" for i in issues):
risk_level = "high"
elif issues:
risk_level = "medium"
return {
"risk_level": risk_level,
"issues": issues,
"needs_human_review": risk_level != "low",
}
4) Run the agent end-to-end and persist an audit trail
For production you want each case to produce a reproducible record. At minimum store input hash, extracted output, rule results, and final disposition.
import json
import hashlib
from pathlib import Path
AUDIT_DIR = Path("./audit_logs")
AUDIT_DIR.mkdir(exist_ok=True)
def hash_text(text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()
def process_kyc_case(document_text: str) -> dict:
profile = kyc_extractor.invoke({"document_text": document_text})
assessment = evaluate_kyc(profile)
record = {
"input_hash": hash_text(document_text),
"profile": profile.model_dump(),
"assessment": assessment,
"final_decision": "review" if assessment["needs_human_review"] else "approve",
}
path = AUDIT_DIR / f"{record['input_hash']}.json"
path.write_text(json.dumps(record, indent=2), encoding="utf-8")
return record
sample_doc = """
Customer Name: Amina Khan
Date of Birth: 1990-04-11
Nationality: GB
Address: 10 King Street London SW1A 1AA
Document Type: Passport
Document Number: UK1234567
Expiry Date: 2028-06-30
PEP Indicator: No
Beneficial Owners: None disclosed
"""
result = process_kyc_case(sample_doc)
print(result)
If you want a cleaner orchestration layer later, wrap these steps in a LangChain RunnableSequence and attach tracing through LangSmith. The important part is that extraction remains model-driven while compliance decisions stay deterministic.
Production Considerations
- •
Deploy in-region for data residency
- •Investment banks often require customer data to stay within specific jurisdictions.
- •Keep model endpoints, vector stores, logs, and audit artifacts in approved regions only.
- •
Trace everything
- •Enable LangSmith tracing or equivalent internal observability.
- •Log prompt versioning, model versioning, output schema versioning, and reviewer overrides so audit can reconstruct every decision.
- •
Add hard guardrails
- •Never let the LLM approve onboarding directly.
- •Use allowlisted outputs via Pydantic schemas and block any response that fails validation or contains unsupported claims.
- •
Separate PII from analytics
- •Redact or tokenize sensitive fields before sending them to non-production tools.
- •Limit access by role; compliance analysts need case details but engineers do not need raw passport numbers in dashboards.
Common Pitfalls
- •
Using the model as the decision-maker
- •Mistake: asking the LLM to decide “approve” or “reject” without rules.
- •Fix: make it extract facts only; let Python policy code decide based on bank-approved thresholds.
- •
Accepting free-form output
- •Mistake: parsing plain text with regex after the fact.
- •Fix: use
with_structured_output()with a Pydantic schema so invalid responses fail fast.
- •
Ignoring audit requirements
- •Mistake: storing only the final result.
- •Fix: persist input hashes, extracted fields, rule hits, timestamps, reviewer actions, and model/version metadata.
- •
Sending sensitive data to uncontrolled services
- •Mistake: routing KYC docs through random third-party tools or unapproved regions.
- •Fix: enforce approved infrastructure only and align storage/processing with compliance and data residency requirements.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit