How to Build a KYC verification Agent Using LangChain in Python for retail banking

By Cyprian AaronsUpdated 2026-04-21
kyc-verificationlangchainpythonretail-banking

A KYC verification agent in retail banking collects customer identity data, checks it against policy and external sources, and decides whether the case is clear, needs manual review, or must be rejected. It matters because onboarding speed is a business metric, but compliance quality is a regulatory requirement.

Architecture

  • Customer intake layer

    • Accepts structured inputs: name, DOB, address, national ID, phone, email.
    • Normalizes formats before any model call.
  • Document extraction layer

    • Pulls text from uploaded IDs, utility bills, bank statements, or selfies with OCR/vision tooling.
    • Converts unstructured documents into machine-readable fields.
  • Policy and rules engine

    • Encodes KYC rules: required fields, country-specific checks, sanctions escalation thresholds.
    • Keeps deterministic decisions out of the LLM.
  • LangChain orchestration layer

    • Uses ChatPromptTemplate, LLMChain or Runnable composition to classify cases and generate review notes.
    • Calls tools for sanctions screening, address validation, and internal customer lookup.
  • Audit and evidence store

    • Persists input payloads, model outputs, tool results, timestamps, and final decisions.
    • Supports regulator review and internal QA.
  • Human review queue

    • Routes ambiguous or high-risk cases to an analyst.
    • Captures analyst override reasons for retraining and policy tuning.

Implementation

1) Define the KYC schema and deterministic checks

Keep the output structured. In banking, free-text answers are a liability because you need traceability and stable downstream processing.

from pydantic import BaseModel, Field
from typing import Literal, Optional

class KYCInput(BaseModel):
    full_name: str
    date_of_birth: str
    address: str
    country: str
    id_number: str
    source_of_funds: Optional[str] = None

class KYCDecision(BaseModel):
    status: Literal["approved", "review", "rejected"]
    reason: str
    risk_score: int = Field(ge=0, le=100)
    missing_fields: list[str] = []

def basic_kyc_rules(data: KYCInput) -> list[str]:
    missing = []
    if not data.full_name.strip():
        missing.append("full_name")
    if not data.date_of_birth.strip():
        missing.append("date_of_birth")
    if not data.address.strip():
        missing.append("address")
    if not data.id_number.strip():
        missing.append("id_number")
    return missing

2) Build a LangChain prompt that enforces policy-driven output

Use ChatPromptTemplate plus structured output. For production banking workflows, do not ask the model to “decide freely”; constrain it to your schema.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a KYC verification assistant for a retail bank. "
     "Follow the bank policy strictly. "
     "If required fields are missing or risk indicators are present, return review or rejected. "
     "Never invent facts."),
    ("human",
     "Customer data:\n"
     "{customer_json}\n\n"
     "Policy notes:\n"
     "- Missing mandatory identity fields => review\n"
     "- Suspicious source of funds language => review\n"
     "- Clearly complete low-risk profile => approved\n"
     "- Output must match the schema exactly.")
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(KYCDecision)
chain = prompt | structured_llm

3) Add tools for screening and wrap everything in an orchestrator

This pattern uses a deterministic pre-check first, then LangChain for judgment on ambiguous cases. That keeps obvious failures out of the model path.

import json
from langchain_core.runnables import RunnableLambda

def screen_case(customer: KYCInput) -> KYCDecision:
    missing = basic_kyc_rules(customer)
    if missing:
        return KYCDecision(
            status="review",
            reason=f"Missing mandatory fields: {', '.join(missing)}",
            risk_score=70,
            missing_fields=missing,
        )

    # Example of rule-based escalation before LLM decision
    if customer.country.lower() in {"high-risk-country-x", "sanctioned-jurisdiction-y"}:
        return KYCDecision(
            status="rejected",
            reason="Country requires enhanced due diligence / blocked by policy",
            risk_score=95,
            missing_fields=[],
        )

    result = chain.invoke({"customer_json": customer.model_dump_json(indent=2)})
    return result

kyc_agent = RunnableLambda(screen_case)

sample = KYCInput(
    full_name="Amina Yusuf",
    date_of_birth="1991-04-18",
    address="12 Cedar Road, Lagos",
    country="NG",
    id_number="NIN12345678901",
    source_of_funds="Salary from employment"
)

decision = kyc_agent.invoke(sample)
print(decision.model_dump())

4) Persist an audit trail with every decision

Retail banking needs evidence. Store the input snapshot, model version, prompt version, rule outcomes, and final decision in an immutable log or append-only store.

from datetime import datetime

def audit_record(customer: KYCInput, decision: KYCDecision) -> dict:
    return {
        "timestamp_utc": datetime.utcnow().isoformat(),
        "customer": customer.model_dump(),
        "decision": decision.model_dump(),
        "model": "gpt-4o-mini",
        "workflow": "kyc-verification-v1",
        "compliance_tags": ["kyc", "retail-banking", "audit-trail"],
    }

record = audit_record(sample, decision)
print(json.dumps(record, indent=2))

Production Considerations

  • Keep PII inside your controlled boundary

    • Redact unnecessary fields before sending text to the LLM.
    • Use region-bound deployment for data residency requirements.
  • Separate rules from model judgment

    • Sanctions hits, mandatory field checks, age thresholds, and jurisdiction blocks should be deterministic.
    • The LLM should handle classification and explanation only.
  • Log everything needed for audit

    • Store prompt version, model version, tool outputs, final status, analyst overrides.
    • Make logs immutable and searchable by case ID.
  • Add human-in-the-loop routing

    • Any mismatch between document OCR and user-entered data should go to manual review.
    • High-risk geographies or unusual source-of-funds language should never auto-approve.

Common Pitfalls

  • Letting the LLM make final compliance decisions

    • Avoid this by using policy gates before model invocation.
    • The model should not override sanctions logic or mandatory field checks.
  • Sending raw sensitive documents into prompts

    • Strip out irrelevant PII and tokenize identifiers where possible.
    • Only pass extracted fields needed for the decision.
  • Skipping version control on prompts and policies

    • If you cannot reproduce a decision later, you do not have an auditable system.
    • Version prompts like code and tie each decision to a specific policy bundle.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides