How to Build a KYC verification Agent Using LlamaIndex in Python for fintech

By Cyprian AaronsUpdated 2026-04-21
kyc-verificationllamaindexpythonfintech

A KYC verification agent ingests customer documents, extracts identity signals, checks them against policy and reference data, and returns a structured decision with evidence. In fintech, that matters because onboarding speed, fraud prevention, and compliance all sit on the same workflow, and you need an agent that can explain every decision to auditors and ops teams.

Architecture

  • Document ingestion layer

    • Accepts PDFs, scans, bank statements, passports, utility bills, and corporate docs.
    • Normalizes files into text with metadata like customer_id, jurisdiction, and source_system.
  • LlamaIndex retrieval layer

    • Uses VectorStoreIndex to search internal KYC policy, AML rules, and country-specific onboarding requirements.
    • Retrieves only the relevant policy chunks for the current applicant.
  • Extraction and verification layer

    • Uses LLM-backed structured extraction to pull fields like name, DOB, address, document number, issue date.
    • Compares extracted values across documents for consistency.
  • Decision engine

    • Applies deterministic checks for completeness, expiry, sanctions flags, residency rules, and mismatches.
    • Produces approve, reject, or manual_review.
  • Audit trail layer

    • Stores retrieved policy snippets, extracted fields, model outputs, timestamps, and versioned prompts.
    • This is what makes the system defensible during compliance review.
  • Human review interface

    • Escalates ambiguous cases with evidence packets.
    • Keeps analysts in control when confidence is low or regulation requires manual approval.

Implementation

1) Load KYC policy docs into a LlamaIndex index

Use your internal policy documents as the source of truth. For fintech work, keep these docs versioned by jurisdiction so the agent can retrieve the right rule set.

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

# Load policy documents from disk
docs = SimpleDirectoryReader(
    input_dir="./kyc_policy_docs",
    recursive=True
).load_data()

# Chunk for retrieval
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=80)
nodes = splitter.get_nodes_from_documents(docs)

# Build index
index = VectorStoreIndex(nodes)

# Create a retriever
retriever = index.as_retriever(similarity_top_k=3)

2) Extract KYC fields from an applicant document

For production KYC flows you want structured output. LlamaIndex supports this pattern through query engines backed by an LLM. The exact model choice depends on your deployment constraints; in regulated environments I usually point this at a private model endpoint.

from pydantic import BaseModel, Field
from typing import Optional
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

class KYCProfile(BaseModel):
    full_name: str = Field(description="Customer legal full name")
    date_of_birth: str = Field(description="Date of birth in YYYY-MM-DD")
    address: str = Field(description="Residential address")
    document_type: str = Field(description="Type of identity document")
    document_number: Optional[str] = Field(default=None)
    expiry_date: Optional[str] = Field(default=None)

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

# Example text extracted from OCR or uploaded PDF parsing
applicant_text = """
Name: Amina Yusuf
DOB: 1991-04-18
Address: 22 Kingfisher Road, Lagos
Document: National ID Card
ID No: NIN12345678901
Expiry: 2029-08-31
"""

prompt = f"""
Extract the applicant KYC profile from this text.
Return only structured data matching the schema.

Text:
{applicant_text}
"""

response = Settings.llm.complete(prompt)
print(response.text)

If you want tighter structure than raw text parsing here, wrap the output in your own validator or use a tool-based extraction flow. The key point is that the agent should emit machine-readable fields before any compliance decision happens.

3) Retrieve policy context and make a decision

This is where LlamaIndex becomes useful beyond extraction. The agent retrieves only relevant policy snippets for the applicant’s jurisdiction and then combines that with deterministic checks.

from llama_index.core import Document

def kyc_decision(profile: dict, jurisdiction: str):
    query = f"KYC onboarding requirements for {jurisdiction} identity verification"
    results = retriever.retrieve(query)

    policy_context = "\n\n".join([r.node.get_content() for r in results])

    missing_fields = []
    for field in ["full_name", "date_of_birth", "address", "document_type"]:
        if not profile.get(field):
            missing_fields.append(field)

    if missing_fields:
        return {
            "decision": "manual_review",
            "reason": f"Missing required fields: {missing_fields}",
            "policy_context": policy_context,
        }

    if profile.get("expiry_date") is None:
        return {
            "decision": "manual_review",
            "reason": "Document expiry date not found",
            "policy_context": policy_context,
        }

    return {
        "decision": "approve",
        "reason": "Required KYC fields present and no blocking issues detected",
        "policy_context": policy_context,
    }

profile = {
    "full_name": "Amina Yusuf",
    "date_of_birth": "1991-04-18",
    "address": "22 Kingfisher Road, Lagos",
    "document_type": "National ID Card",
    "document_number": "NIN12345678901",
    "expiry_date": "2029-08-31",
}

result = kyc_decision(profile, jurisdiction="Nigeria")
print(result)

That pattern gives you two important properties:

  • Retrieval grounds the agent in current policy.
  • Deterministic checks keep final decisions predictable.

4) Store an audit record for compliance

For fintech you need to reconstruct every step later. Save the input doc metadata, retrieved nodes, extracted fields, prompt version, model version, and final decision.

import json
from datetime import datetime

def build_audit_record(customer_id: str, profile: dict, result: dict):
    return {
        "customer_id": customer_id,
        "timestamp_utc": datetime.utcnow().isoformat(),
        "extracted_profile": profile,
        "decision": result["decision"],
        "reason": result["reason"],
        "policy_context": result["policy_context"],
        "model_version": getattr(Settings.llm.metadata(), "model_name", None),
        "prompt_version": "kyc-extract-v1",
    }

audit_record = build_audit_record("cust_001", profile, result)

with open("kyc_audit_log.jsonl", "a") as f:
    f.write(json.dumps(audit_record) + "\n")

Production Considerations

  • Data residency

    • Keep customer PII inside the required region.
    • If your regulator requires local storage or processing boundaries, do not send raw identity docs to external endpoints outside that region.
  • Monitoring

    • Track approval rate by country, manual review rate, false reject rate, and extraction failure rate.
    • Alert on drift when one document type or one jurisdiction starts failing more often than baseline.
  • Guardrails

    • Never let the model make unsupervised sanctions or PEP decisions without deterministic backstops.
    • Use allowlists for supported jurisdictions and block unsupported flows early.
  • Auditability

    • Version prompts like code.
    • Persist retrieved policy chunks so compliance can see exactly which rule informed a decision.

Common Pitfalls

  1. Using the LLM as the final authority

    • Bad pattern: “model says approved.”
    • Fix it by making the model extract fields and summarize evidence while your rules engine makes the final call.
  2. Skipping jurisdiction-specific policy retrieval

    • A single global KYC rule set will fail fast in real banking workflows.
    • Fix it by indexing policies per country or region and retrieving only what applies to that customer.
  3. Not storing evidence for manual review

    • If an analyst cannot see why a case was escalated, throughput drops and compliance gets noisy.
    • Fix it by logging extracted fields, source snippets from Retriever.retrieve(), and the exact decision reason.
  4. Ignoring PII handling controls

    • Raw passport images in logs are a security incident waiting to happen.
    • Fix it by redacting sensitive values in logs and keeping encrypted storage with strict access control.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides