How to Build a KYC verification Agent Using LangChain in Python for fintech

By Cyprian AaronsUpdated 2026-04-21
kyc-verificationlangchainpythonfintech

A KYC verification agent automates the first pass of customer due diligence: it extracts identity data from submitted documents, checks completeness, flags mismatches, and routes risky cases for manual review. For fintech, this matters because onboarding speed and compliance are in direct tension, and a good agent reduces false positives without weakening auditability or regulatory controls.

Architecture

  • Document ingestion layer

    • Accepts PDFs, images, and structured form payloads.
    • Normalizes them into text and metadata for downstream checks.
  • Extraction chain

    • Uses an LLM to extract KYC fields like full name, DOB, address, document number, and expiry date.
    • Returns structured output, not free-form text.
  • Policy validation layer

    • Compares extracted fields against business rules.
    • Checks mandatory fields, document freshness, country restrictions, and format constraints.
  • Risk scoring / decision router

    • Assigns a decision: approve, manual_review, or reject.
    • Routes edge cases to human ops instead of forcing an automated answer.
  • Audit logging layer

    • Stores input hashes, model outputs, prompts, policy decisions, timestamps, and reviewer actions.
    • Supports compliance review and post-incident analysis.
  • Human-in-the-loop review queue

    • Handles uncertain cases.
    • Keeps the final decision under controlled operational oversight.

Implementation

1) Install dependencies and define the KYC schema

Use LangChain’s structured output so the model returns validated fields. In production you want typed outputs because downstream compliance logic should not parse prose.

pip install langchain langchain-openai pydantic
from datetime import date
from typing import Literal, Optional

from pydantic import BaseModel, Field


class KYCResult(BaseModel):
    full_name: str = Field(description="Customer legal name")
    date_of_birth: str = Field(description="Date of birth in YYYY-MM-DD")
    document_type: Literal["passport", "national_id", "driver_license"]
    document_number: str
    expiry_date: Optional[str] = Field(default=None, description="YYYY-MM-DD if present")
    country_of_issue: str
    address: Optional[str] = None

2) Build the extraction chain with ChatOpenAI and with_structured_output

This is the core pattern. The model extracts fields from raw KYC text into a Pydantic object that your policy engine can consume directly.

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    api_key=os.environ["OPENAI_API_KEY"],
)

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a KYC extraction engine for a fintech onboarding workflow. "
     "Extract only factual data from the provided document text. "
     "Do not infer missing values."),
    ("human", "Document text:\n\n{document_text}")
])

structured_llm = llm.with_structured_output(KYCResult)
extract_chain = prompt | structured_llm

sample_text = """
Passport No: X1234567
Name: Amina Yusuf
DOB: 1991-04-18
Nationality: KE
Country of Issue: Kenya
Expiry Date: 2029-08-31
Address: 14 Riverside Drive, Nairobi
"""

result = extract_chain.invoke({"document_text": sample_text})
print(result.model_dump())

3) Add deterministic policy checks outside the model

Do not ask the LLM to decide compliance. Keep policy rules in Python so they are testable and auditable.

from datetime import datetime


def validate_kyc(result: KYCResult) -> dict:
    issues = []

    required_fields = [
        result.full_name,
        result.date_of_birth,
        result.document_type,
        result.document_number,
        result.country_of_issue,
    ]
    if any(not field for field in required_fields):
        issues.append("missing_required_field")

    try:
        dob = datetime.strptime(result.date_of_birth, "%Y-%m-%d").date()
        if dob > date.today():
            issues.append("invalid_dob")
    except ValueError:
        issues.append("dob_format_invalid")

    if result.expiry_date:
        try:
            expiry = datetime.strptime(result.expiry_date, "%Y-%m-%d").date()
            if expiry <= date.today():
                issues.append("document_expired")
        except ValueError:
            issues.append("expiry_format_invalid")

    decision = "approve"
    if issues:
        decision = "manual_review" if len(issues) <= 2 else "reject"

    return {
        "decision": decision,
        "issues": issues,
        "kyc_record": result.model_dump(),
    }


policy_result = validate_kyc(result)
print(policy_result)

4) Wrap it into a runnable onboarding function with audit-friendly outputs

This gives you a clean service boundary. The function returns both machine-readable output and enough metadata for logging.

import hashlib
import json


def kyc_pipeline(document_text: str) -> dict:
    doc_hash = hashlib.sha256(document_text.encode("utf-8")).hexdigest()

    extracted = extract_chain.invoke({"document_text": document_text})
    validation = validate_kyc(extracted)

    audit_event = {
        "document_hash": doc_hash,
        "model": "gpt-4o-mini",
        "decision": validation["decision"],
        "issues": validation["issues"],
        "record": validation["kyc_record"],
    }

    return audit_event


event = kyc_pipeline(sample_text)
print(json.dumps(event, indent=2))

Production Considerations

  • Keep PII inside your residency boundary

    • If your regulator requires regional storage or processing, pin deployment to approved regions.
    • Redact sensitive fields before sending anything to external services where possible.
  • Log everything needed for audit

    • Store prompt version, model version, input hash, extracted fields, policy outcome, and reviewer overrides.
    • Never rely on raw LLM output alone as evidence of compliance.
  • Add guardrails around confidence and fallback

    • Route low-confidence extractions to manual review.
    • Use deterministic validators for dates, expiry windows, sanctioned-country lists, and document formats.
  • Monitor drift in document quality

    • OCR quality changes fast across mobile uploads.
    • Track extraction failure rates by device type, region, language, and document template.

Common Pitfalls

  1. Letting the LLM make compliance decisions

    • Bad pattern: “approve this applicant if everything looks fine.”
    • Fix: use the model for extraction only; keep approval logic in deterministic code.
  2. Parsing free-form text instead of structured output

    • Bad pattern: regex over chat responses.
    • Fix: use with_structured_output() with Pydantic models so failures are explicit.
  3. Ignoring auditability and data handling rules

    • Bad pattern: logging full passports into application logs.
    • Fix: hash documents, mask PII in logs, version prompts/models, and store reviewer actions separately.

A KYC agent that works in fintech is not just an LLM wrapper. It is an extraction service plus policy engine plus audit trail plus human review path. If you keep those boundaries clean with LangChain’s structured output and deterministic validation outside the model, you get something regulators can inspect and operations can actually run.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides