How to Build a KYC verification Agent Using LlamaIndex in Python for payments
A KYC verification agent for payments ingests customer identity documents, extracts structured fields, checks them against policy and external sources, and returns a decision with an audit trail. For payments teams, this matters because onboarding speed, fraud prevention, and regulatory compliance all depend on making the verification step deterministic, traceable, and easy to review.
Architecture
- •
Document ingestion layer
- •Accepts passports, national IDs, utility bills, bank statements, and business registration docs.
- •Normalizes PDFs, images, and OCR text into a consistent document format.
- •
Knowledge base
- •Stores KYC policy documents, country-specific rules, acceptable document lists, and escalation procedures.
- •Backed by
VectorStoreIndexso the agent can retrieve policy context before making a decision.
- •
Extraction pipeline
- •Uses LLM-backed structured extraction to pull out name, DOB, address, ID number, expiry date, and entity type.
- •Produces machine-readable outputs for downstream checks.
- •
Verification logic
- •Compares extracted fields against policy constraints and reference data.
- •Flags mismatches, expired documents, missing pages, or unsupported jurisdictions.
- •
Decision engine
- •Returns
approve,reject, ormanual_review. - •Attaches reasons and evidence snippets for auditability.
- •Returns
- •
Audit and observability
- •Logs inputs, retrieved policy chunks, model outputs, and final decisions.
- •Required for payments compliance reviews and dispute handling.
Implementation
1) Install dependencies and load your KYC policy corpus
You want your agent grounded in internal policy before it evaluates any document. In practice that means indexing your KYC handbook, jurisdiction rules, and payment risk playbooks with VectorStoreIndex.
from pathlib import Path
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
# Load internal KYC/policy docs
docs = SimpleDirectoryReader(
input_dir="./kyc_policy_docs",
recursive=True
).load_data()
# Build an index over the policy corpus
index = VectorStoreIndex.from_documents(docs)
# Create a retriever for policy lookup
retriever = index.as_retriever(similarity_top_k=3)
2) Define the fields you need from a customer submission
For payments KYC you usually need more than just a name. You need identity attributes that support screening, sanctions checks, residency rules, and business verification.
from pydantic import BaseModel, Field
from typing import Optional
class KYCProfile(BaseModel):
full_name: str = Field(..., description="Legal full name")
date_of_birth: Optional[str] = Field(None, description="YYYY-MM-DD")
document_type: str = Field(..., description="passport|national_id|utility_bill|business_registration")
document_number: Optional[str] = None
issuing_country: Optional[str] = None
expiry_date: Optional[str] = None
address: Optional[str] = None
entity_type: str = Field(..., description="individual|business")
3) Use LlamaIndex structured prediction to extract KYC data
The actual pattern here is to combine retrieval with structured extraction. First pull the relevant policy context; then ask the LLM to extract fields into your schema using StructuredPredictionQueryEngine or a direct programmatic workflow around as_query_engine() plus structured output handling. A simple production-friendly pattern is shown below using PydanticProgram from LlamaIndex’s programmatic extraction stack.
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.core.program import LLMTextCompletionProgram
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
policy_query_engine = index.as_query_engine(similarity_top_k=3)
def extract_kyc_profile(doc_text: str) -> KYCProfile:
prompt_template = """
You are extracting KYC data for a payments onboarding flow.
Return only fields present in the schema.
Document text:
{doc_text}
"""
program = LLMTextCompletionProgram.from_defaults(
output_cls=KYCProfile,
llm=Settings.llm,
prompt_template_str=prompt_template,
verbose=False,
)
return program(doc_text=doc_text)
customer_doc_text = """
Passport No: X1234567
Name: Jane Doe
DOB: 1990-05-14
Country of issue: KE
Expiry: 2030-08-01
Address: 12 Riverside Drive, Nairobi
"""
profile = extract_kyc_profile(customer_doc_text)
print(profile.model_dump())
4) Add a decision function with policy retrieval
This is where the agent becomes useful for payments. It should not just extract data; it should explain whether the submission passes policy based on retrieved rules.
def decide_kyc(profile: KYCProfile):
query = f"""
Review this KYC profile against our payment onboarding policy:
{profile.model_dump()}
"""
response = policy_query_engine.query(query)
rationale = str(response)
if profile.expiry_date is None:
return {"decision": "manual_review", "reason": "Missing expiry date", "policy": rationale}
if profile.entity_type not in {"individual", "business"}:
return {"decision": "reject", "reason": "Invalid entity type", "policy": rationale}
return {"decision": "approve", "reason": "Meets baseline checks", "policy": rationale}
result = decide_kyc(profile)
print(result)
Production Considerations
- •
Keep PII inside your trust boundary
- •Encrypt documents at rest and in transit.
- •If you use hosted LLMs or vector stores, confirm data residency requirements match your payment regions.
- •
Log every decision path
- •Store the extracted fields, retrieved policy chunks, model version, prompt version, and final outcome.
- •Auditors will ask why a user was approved or rejected; “the model said so” is not enough.
- •
Add hard guardrails before approval
- •Reject obvious failures deterministically: expired IDs, unsupported countries, missing mandatory fields.
- •Use the LLM for extraction and explanation; use code for final enforcement.
- •
Monitor false positives by segment
- •Track approval/rejection rates by country, document type, and customer segment.
- •Payments onboarding often breaks when one region gets overblocked because the model underperforms on local ID formats.
Common Pitfalls
- •
Using the LLM as the final authority
- •Don’t let free-form model output directly approve accounts.
- •Wrap it in deterministic checks so compliance rules stay enforceable.
- •
Skipping jurisdiction-specific policies
- •A passport acceptable in one market may be insufficient in another.
- •Index country-level policies separately and retrieve by jurisdiction before deciding.
- •
Not preserving audit evidence
- •If you only store the final answer, you lose the reasoning chain.
- •Persist retrieved context snippets and extracted fields so reviewers can reconstruct the decision later.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit