How to Build a KYC verification Agent Using LangChain in Python for insurance

By Cyprian AaronsUpdated 2026-04-21
kyc-verificationlangchainpythoninsurance

A KYC verification agent for insurance collects customer identity data, checks it against policy and regulatory rules, flags missing or inconsistent information, and produces an auditable decision trail. That matters because insurers need to onboard customers fast without weakening AML/KYC controls, privacy requirements, or underwriting quality.

Architecture

  • Input capture layer

    • Accepts structured customer data: name, DOB, address, government ID number, policy type, jurisdiction.
    • Normalizes fields before they hit the agent.
  • Document extraction layer

    • Pulls text from uploaded IDs, proof of address, incorporation docs, or beneficiary forms.
    • In production this usually comes from OCR or document AI before LangChain sees the text.
  • KYC rules engine

    • Encodes deterministic checks: completeness, format validation, jurisdiction-specific required fields.
    • Keeps hard compliance rules out of the LLM.
  • LangChain reasoning layer

    • Uses ChatOpenAI with a structured output schema to classify risk, identify missing evidence, and explain the decision.
    • Produces machine-readable results for downstream workflow systems.
  • Audit and evidence store

    • Persists inputs, extracted facts, model output, timestamps, and rule decisions.
    • Required for insurance auditability and internal review.
  • Human review queue

    • Routes borderline or high-risk cases to compliance analysts.
    • Prevents automatic approval when confidence is low or documents conflict.

Implementation

1. Define the KYC schema and deterministic checks

Keep your compliance rules explicit. The LLM should interpret evidence and summarize gaps; it should not decide whether a missing passport number is acceptable in a regulated flow.

from typing import List, Literal
from pydantic import BaseModel, Field

class KYCResult(BaseModel):
    status: Literal["pass", "review", "fail"] = Field(...)
    risk_level: Literal["low", "medium", "high"] = Field(...)
    missing_fields: List[str] = Field(default_factory=list)
    issues_found: List[str] = Field(default_factory=list)
    summary: str = Field(...)

REQUIRED_FIELDS = ["full_name", "date_of_birth", "address", "government_id"]

def deterministic_checks(customer: dict) -> list[str]:
    issues = []
    for field in REQUIRED_FIELDS:
        if not customer.get(field):
            issues.append(f"Missing required field: {field}")
    if customer.get("country") == "US" and not customer.get("ssn_last4"):
        issues.append("Missing SSN last4 for US customer")
    return issues

2. Build the LangChain chain with structured output

This pattern uses ChatOpenAI plus with_structured_output() so you get typed JSON back instead of free-form text. That makes it easier to wire into underwriting or onboarding workflows.

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    api_key=os.environ["OPENAI_API_KEY"],
)

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a KYC verification assistant for an insurance company. "
     "Use only the provided customer data and extracted document text. "
     "Do not invent facts. Return a compliant assessment."),
    ("human",
     """Customer record:
{customer_json}

Deterministic issues:
{deterministic_issues}

Extracted document text:
{document_text}

Assess whether this case passes KYC, needs human review, or fails.
Explain any missing fields or inconsistencies clearly.""")
])

kyc_chain = prompt | llm.with_structured_output(KYCResult)

3. Run the agent on a real case

In production you would feed this chain from your intake API after OCR and validation. The key is that the model sees both the raw record and the rule-based findings.

import json

customer = {
    "full_name": "Amina Yusuf",
    "date_of_birth": "1988-02-14",
    "address": "12 Park Lane, Lagos",
    "government_id": "ID12345678",
    "country": "NG",
    "policy_type": "life_insurance"
}

document_text = """
National Identity Card
Name: Amina Yusuf
DOB: 1988-02-14
Address: 12 Park Lane, Lagos
ID No: ID12345678
"""

deterministic_issues = deterministic_checks(customer)

result = kyc_chain.invoke({
    "customer_json": json.dumps(customer),
    "deterministic_issues": "\n".join(deterministic_issues) or "None",
    "document_text": document_text,
})

print(result.model_dump())

4. Add audit logging for compliance

Insurance teams will ask who approved what, when, with which evidence. Store the input payloads and model output together so you can reconstruct every decision during internal audit or regulator review.

from datetime import datetime
from pathlib import Path

def write_audit_record(customer: dict, doc_text: str, result: KYCResult) -> None:
    record = {
        "timestamp_utc": datetime.utcnow().isoformat(),
        "customer": customer,
        "document_text": doc_text,
        "kyc_result": result.model_dump(),
    }
    Path("audit_log.jsonl").open("a", encoding="utf-8").write(json.dumps(record) + "\n")

write_audit_record(customer, document_text, result)

Production Considerations

  • Data residency

    • Keep PII in-region if your insurer operates under local residency rules.
    • If you use hosted models, confirm where prompts and logs are processed and retained.
  • Monitoring

    • Track pass/review/fail rates by country, product line, and channel.
    • Watch for drift when document templates change or new jurisdictions are added.
  • Guardrails

    • Never let the LLM override hard compliance checks.
    • Use structured output only; reject malformed responses before they hit workflow systems.
  • Auditability

    • Persist raw input, extracted evidence, deterministic rule outputs, model output, and reviewer overrides.
    • Make records immutable enough for compliance teams to trust them later.

Common Pitfalls

  1. Letting the model make legal decisions

    • Don’t ask the LLM to decide if a case is “compliant” in isolation.
    • Use it to summarize evidence and flag uncertainty; keep approval logic in code.
  2. Skipping jurisdiction-specific rules

    • KYC requirements differ across countries and product types.
    • Encode those differences in configuration tables or policy services instead of hardcoding one global flow.
  3. Not storing enough evidence

    • If you only store the final answer, you cannot defend the decision later.
    • Save source documents, extracted text snippets, rule hits, timestamps, and reviewer actions in your audit trail.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides