How to Build a KYC verification Agent Using LlamaIndex in Python for banking

By Cyprian AaronsUpdated 2026-04-21

kyc-verificationllamaindexpythonbanking

A KYC verification agent automates the first pass of customer due diligence: it ingests identity documents, extracts structured fields, checks them against policy and external sources, and flags mismatches for human review. For banking, this matters because onboarding speed is tied directly to conversion, but every automated decision still has to survive compliance, audit, and model-risk scrutiny.

Architecture

•
Document ingestion layer
- •Accepts passports, national IDs, utility bills, bank statements, and corporate registration docs.
- •Normalizes PDFs, scans, and images into text plus metadata.
•
Extraction and indexing layer
- •Uses LlamaIndex to turn raw documents into searchable nodes.
- •Stores extracted fields like name, DOB, address, document number, issue date.
•
Policy retrieval layer
- •Retrieves bank KYC rules from internal SOPs, jurisdiction-specific policies, and product-specific onboarding rules.
- •Keeps the agent grounded in current compliance requirements.
•
Verification engine
- •Compares extracted customer data against policy and reference sources.
- •Produces structured outcomes: pass, fail, manual review.
•
Audit trail layer
- •Persists inputs, retrieved policy snippets, model outputs, and final decisions.
- •Supports regulatory review and internal QA.
•
Human escalation layer
- •Routes edge cases to analysts when confidence is low or rules conflict.
- •Prevents over-automation on high-risk cases.

Implementation

1) Load KYC policy documents into a LlamaIndex index

Start with your internal KYC policy docs. In banking, this should be the exact version approved by compliance and legal, not a wiki copy.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.settings import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# Configure models
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

# Load internal KYC policies
docs = SimpleDirectoryReader("./kyc_policies").load_data()

# Build index for retrieval over policy text
policy_index = VectorStoreIndex.from_documents(docs)
policy_retriever = policy_index.as_retriever(similarity_top_k=3)

This gives you grounded retrieval over the bank’s actual onboarding rules. The key point is that the agent should answer from policy text first, not from model memory.

2) Define a structured extraction schema for customer data

For KYC you want structured output. Use Pydantic models so downstream logic can validate required fields before any decision is made.

from pydantic import BaseModel, Field
from typing import Optional

class KYCProfile(BaseModel):
    full_name: str = Field(..., description="Customer legal name")
    date_of_birth: Optional[str] = Field(None, description="YYYY-MM-DD if available")
    address: Optional[str] = Field(None, description="Residential address")
    document_type: str = Field(..., description="Passport, national ID, etc.")
    document_number: Optional[str] = None
    country_of_issue: Optional[str] = None

If you are processing OCR text from a passport or utility bill, this schema becomes the contract between extraction and verification. It also makes it easier to log exactly what the agent believed at decision time.

3) Build an agent that retrieves policy and produces a verification verdict

Use FunctionAgent with a tool that retrieves relevant policy snippets. Then ask the model to compare extracted customer data against those snippets and return a controlled result.

from llama_index.core.agent.workflow import FunctionAgent
from llama_index.core.tools import FunctionTool

def retrieve_kyc_policy(query: str) -> str:
    nodes = policy_retriever.retrieve(query)
    return "\n\n".join([node.node.get_content() for node in nodes])

policy_tool = FunctionTool.from_defaults(
    fn=retrieve_kyc_policy,
    name="retrieve_kyc_policy",
    description="Retrieve relevant KYC policy sections for a given verification question."
)

agent = FunctionAgent(
    tools=[policy_tool],
    llm=Settings.llm,
    system_prompt=(
        "You are a banking KYC verification agent. "
        "Use only retrieved policy text and provided customer data. "
        "Return one of: PASS, FAIL, MANUAL_REVIEW. "
        "Explain which rule was applied."
    ),
)

customer_text = """
Customer legal name: Amina Yusuf
DOB: 1991-04-12
Address: 14 River Road, Nairobi
Document type: Passport
Document number: P12345678
Country of issue: KE
"""

response = agent.run_sync(
    f"""
Verify this customer against bank KYC rules.

Customer data:
{customer_text}

First retrieve the relevant policy sections using the tool.
Then provide a verdict with rationale.
"""
)

print(response)

This pattern works because retrieval is explicit. In production you can wrap this with OCR extraction upstream and sanctions/PEP screening downstream.

4) Add deterministic post-processing before any onboarding decision

Do not let the LLM make the final business decision alone. Parse its output into a controlled structure and enforce hard rules in Python.

import re

def parse_verdict(text: str) -> str:
    match = re.search(r"\b(PASS|FAIL|MANUAL_REVIEW)\b", str(text))
    if not match:
        return "MANUAL_REVIEW"
    return match.group(1)

verdict = parse_verdict(response)

if verdict == "PASS":
    next_step = "approve_onboarding"
elif verdict == "FAIL":
    next_step = "reject_and_log"
else:
    next_step = "send_to_analyst"

print({"verdict": verdict, "next_step": next_step})

That last step matters. Banking workflows need deterministic control points so you can enforce thresholds like document freshness windows or jurisdiction-specific exceptions outside the model.

Production Considerations

•
Keep sensitive data inside your residency boundary
- •If your bank operates in-region storage requirements, pin OCR text, embeddings storage, logs, and vector DBs to approved regions.
- •Avoid sending raw identity documents to unmanaged third-party services.
•
Log everything needed for audit
- •Persist retrieved policy chunks via node_id, input hashes, model version, prompt version, verdicts, and analyst overrides.
- •Regulators will ask why a case was passed or escalated; “the model said so” is not acceptable.
•
Add guardrails around high-risk decisions
- •Force MANUAL_REVIEW when fields are missing or OCR confidence is low.
- •Block auto-pass on politically exposed persons (PEPs), sanctioned geographies, expired IDs, or inconsistent addresses.
•
Monitor drift in both documents and policies
- •Track extraction failure rates by document type.
- •Re-index policies whenever compliance updates onboarding rules; stale retrieval is a real control failure.

Common Pitfalls

•
Using free-form LLM output as the final decision
- •Avoid this by parsing into strict outcomes like PASS, FAIL, or MANUAL_REVIEW.
- •Keep business logic in Python after the model response.
•
Indexing outdated compliance documents
- •If your retriever pulls old KYC standards, your agent will confidently apply wrong rules.
- •Version policies explicitly and rebuild indexes on approval changes.
•
Skipping human review on ambiguous cases
- •OCR errors on names or addresses are common in scanned IDs.
- •Route low-confidence matches to analysts instead of forcing an automated reject or approve.

A solid KYC agent is not just an LLM wrapped around PDFs. It is retrieval-grounded policy enforcement plus deterministic controls plus auditability. That combination is what makes it usable in banking without creating an operational or regulatory mess.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit