How to Build a KYC verification Agent Using LlamaIndex in Python for lending

By Cyprian AaronsUpdated 2026-04-21
kyc-verificationllamaindexpythonlending

A KYC verification agent for lending takes borrower documents, extracts identity and business signals, checks them against policy, and returns a decision-ready summary with evidence. For lenders, this matters because onboarding speed is tied directly to conversion, but every automated decision still has to survive compliance review, audit requests, and model risk scrutiny.

Architecture

  • Document ingestion layer

    • Accepts PDFs, scans, bank statements, utility bills, passports, and incorporation documents.
    • Normalizes files into text and metadata before they hit the agent.
  • LlamaIndex retrieval layer

    • Uses VectorStoreIndex for semantic lookup across policy docs, KYC checklists, and jurisdiction-specific rules.
    • Lets the agent ground decisions in internal policy instead of free-form generation.
  • Extraction and verification tools

    • Pulls out fields like full name, address, DOB, registration number, and document expiry.
    • Cross-checks extracted values against expected lending requirements.
  • Decision orchestration agent

    • Uses FunctionAgent or a tool-calling LLM workflow to decide whether to approve, reject, or escalate.
    • Produces a structured result that downstream systems can store.
  • Audit trail and evidence store

    • Persists extracted fields, source citations, timestamps, and the exact policy chunks used.
    • This is non-negotiable for lending audits.

Implementation

1) Install dependencies and load your KYC policy corpus

You want the agent grounded in your internal KYC policy docs first. In practice that means ingesting lending policies, AML/KYC rules, and country-specific onboarding checklists into a retrievable index.

from pathlib import Path
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.settings import Settings
from llama_index.llms.openai import OpenAI

# Configure your LLM
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

# Load policy documents from disk
docs = SimpleDirectoryReader(
    input_dir="./kyc_policies",
    recursive=True
).load_data()

# Build the index used by the agent for grounded retrieval
policy_index = VectorStoreIndex.from_documents(docs)
policy_retriever = policy_index.as_retriever(similarity_top_k=3)

2) Define tools for extraction and policy lookup

For lending KYC you usually need two tool types: one that retrieves relevant policy text and another that extracts fields from uploaded documents. Keep extraction deterministic where possible; don’t ask the LLM to invent structure from scratch.

import re
from typing import Dict
from llama_index.core.tools import FunctionTool

def extract_kyc_fields(text: str) -> Dict[str, str]:
    patterns = {
        "full_name": r"Name[:\s]+([A-Za-z ,.'-]+)",
        "dob": r"(?:DOB|Date of Birth)[:\s]+([0-9]{2}/[0-9]{2}/[0-9]{4})",
        "address": r"Address[:\s]+(.+)",
        "id_number": r"(?:ID No|Passport No|Registration No)[:\s]+([A-Z0-9\-]+)",
    }
    results = {}
    for key, pattern in patterns.items():
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            results[key] = match.group(1).strip()
    return results

def retrieve_policy(query: str) -> str:
    nodes = policy_retriever.retrieve(query)
    return "\n\n".join([node.node.get_content() for node in nodes])

extract_tool = FunctionTool.from_defaults(
    fn=extract_kyc_fields,
    name="extract_kyc_fields",
    description="Extract KYC fields from OCR text."
)

policy_tool = FunctionTool.from_defaults(
    fn=retrieve_policy,
    name="retrieve_policy",
    description="Retrieve relevant lending KYC policy text."
)

3) Build the verification agent with structured output

The pattern here is: extract facts from the document text, retrieve applicable policy chunks, then ask the agent to classify the case with citations. Use a strict schema so your backend can consume the result without parsing prose.

from pydantic import BaseModel
from typing import List, Literal
from llama_index.core.agent import FunctionAgent

class KYCDecision(BaseModel):
    decision: Literal["approve", "reject", "manual_review"]
    reasons: List[str]
    missing_fields: List[str]
    evidence: List[str]

system_prompt = """
You are a KYC verification agent for lending.
Use only the provided tools and retrieved policy text.
Return decisions based on lending KYC requirements.
If data is incomplete or inconsistent, choose manual_review.
Always include evidence from source text or policy references.
"""

agent = FunctionAgent.from_tools(
    tools=[extract_tool, policy_tool],
    llm=Settings.llm,
    system_prompt=system_prompt,
)

def verify_applicant(ocr_text: str) -> KYCDecision:
    extracted = extract_kyc_fields(ocr_text)
    policy_context = retrieve_policy("lending KYC requirements for individual borrower onboarding")
    
    prompt = f"""
Applicant OCR text:
{ocr_text}

Extracted fields:
{extracted}

Relevant policy:
{policy_context}

Decide whether this applicant should be approved, rejected, or sent to manual review.
Return JSON matching this schema:
{KYCDecision.model_json_schema()}
"""
    response = agent.chat(prompt)
    return KYCDecision.model_validate_json(response.message.content)

# Example usage
sample_text = """
Name: Jane Doe
DOB: 14/02/1990
Address: 12 Market Street, Nairobi
ID No: A1234567
"""
result = verify_applicant(sample_text)
print(result.model_dump())

4) Persist audit evidence for compliance review

Lending teams need traceability. Store the raw OCR text hash, extracted fields, retrieved policy snippets, model version, and final decision. That gives you a defensible audit trail when underwriting or compliance asks why a case was routed manually.

import json
import hashlib
from datetime import datetime

def audit_record(ocr_text: str, decision: KYCDecision) -> dict:
    return {
        "timestamp_utc": datetime.utcnow().isoformat(),
        "ocr_sha256": hashlib.sha256(ocr_text.encode("utf-8")).hexdigest(),
        "decision": decision.model_dump(),
        "model": Settings.llm.metadata.model_name if Settings.llm.metadata else "unknown",
        "workflow": "kyc_verification_agent_v1",
    }

record = audit_record(sample_text, result)
with open("audit_log.jsonl", "a", encoding="utf-8") as f:
    f.write(json.dumps(record) + "\n")

Production Considerations

  • Data residency

    • Keep borrower PII in-region if your lending book operates under local residency rules.
    • If you use managed LLM APIs, confirm where prompts and embeddings are processed and retained.
  • Monitoring

    • Track approval rate by geography, document type failure rate, manual review rate, and extraction confidence.
    • Alert on spikes in “manual_review” because they often indicate OCR drift or a bad prompt change.
  • Guardrails

    • Never let the agent make final credit decisions; it should only support KYC eligibility.
    • Enforce schema validation with Pydantic before writing results into underwriting systems.
  • Compliance controls

    • Version your prompts and policy corpus.
    • Keep immutable logs of source documents cited by the agent for every decision path.

Common Pitfalls

  1. Using the LLM as an OCR engine

    • Don’t feed raw scans directly to the agent and expect reliable field extraction.
    • Run OCR first with a dedicated engine like Tesseract or AWS Textract, then pass cleaned text into LlamaIndex.
  2. Mixing KYC with credit risk logic

    • KYC verifies identity and completeness; it does not decide affordability or default risk.
    • Keep those workflows separate so compliance can review them independently.
  3. Skipping jurisdiction-specific rules

    • A single global checklist will fail in production because lender obligations vary by country and product type.
    • Index local policies separately and route retrieval by jurisdiction before calling the agent.
  4. No human escalation path

    • Some cases will always need manual review: mismatched names, expired IDs, low-quality scans.
    • Build an explicit escalation state instead of forcing an approve/reject answer when evidence is weak.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides