How to Build a KYC verification Agent Using LlamaIndex in Python for insurance
A KYC verification agent for insurance ingests identity documents, extracts customer details, checks them against policy rules and watchlists, and produces a traceable decision: approve, reject, or send to manual review. It matters because insurers need faster onboarding without losing control over compliance, auditability, and data handling.
Architecture
- •
Document ingestion layer
- •Accepts PDFs, scans, and image-based IDs from broker portals or internal ops tools.
- •Normalizes files before they hit the LLM pipeline.
- •
Extraction and reasoning layer
- •Uses LlamaIndex to turn unstructured documents into structured KYC fields.
- •Extracts name, DOB, address, ID number, expiry date, and document type.
- •
Policy rules engine
- •Applies insurer-specific checks like minimum age, country restrictions, expired ID rejection, and mandatory fields.
- •Keeps deterministic logic separate from LLM output.
- •
Evidence store
- •Persists source documents, extracted fields, model outputs, and decision traces.
- •Needed for audit, disputes, and regulator review.
- •
Human review queue
- •Routes ambiguous or low-confidence cases to an operations analyst.
- •Prevents bad auto-decisions when the document quality is poor.
- •
Audit and monitoring layer
- •Logs every prompt, response, retrieval source, and final outcome.
- •Tracks failure rates by document type and region.
Implementation
1) Install the right packages
You need LlamaIndex plus a parser for PDFs. For production KYC work, keep extraction deterministic where possible and use the LLM only where it adds value.
pip install llama-index llama-index-llms-openai pydantic pypdf
Set your OpenAI key if you are using OpenAI from LlamaIndex:
export OPENAI_API_KEY="your-key"
2) Load the document and build a query engine
Use SimpleDirectoryReader to load uploaded files. Then create a VectorStoreIndex so the agent can retrieve relevant chunks from the KYC packet before extracting fields.
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.openai import OpenAI
# Load KYC documents from a folder
documents = SimpleDirectoryReader("./kyc_docs").load_data()
# Build an index over the documents
index = VectorStoreIndex.from_documents(documents)
# Create an LLM instance
llm = OpenAI(model="gpt-4o-mini", temperature=0)
# Query engine for focused extraction
query_engine = index.as_query_engine(llm=llm)
This pattern works well when a customer uploads multiple files: passport scan, proof of address, and application form. The retriever brings back only the relevant text instead of dumping the whole packet into the prompt.
3) Extract structured KYC data with PydanticOutputParser
For insurance onboarding you want structured output, not free-form text. Use a Pydantic model plus PydanticOutputParser so your agent returns predictable fields that downstream systems can validate.
from typing import Optional
from pydantic import BaseModel, Field
from llama_index.core.output_parsers import PydanticOutputParser
class KYCResult(BaseModel):
full_name: str = Field(description="Customer full name")
date_of_birth: str = Field(description="Date of birth in YYYY-MM-DD")
id_number: Optional[str] = Field(default=None, description="Government ID number")
id_type: Optional[str] = Field(default=None, description="Type of ID document")
address: Optional[str] = Field(default=None, description="Residential address")
expiry_date: Optional[str] = Field(default=None, description="Document expiry date in YYYY-MM-DD")
confidence: float = Field(description="Confidence score between 0 and 1")
risk_flag: str = Field(description="APPROVE, REVIEW, or REJECT")
parser = PydanticOutputParser(output_cls=KYCResult)
prompt = f"""
Extract KYC details from the provided insurance onboarding documents.
Return valid JSON matching this schema:
{parser.format_string}
Rules:
- If any mandatory field is missing or unclear, set risk_flag to REVIEW.
- If the ID is expired or the applicant is underage for this product,
set risk_flag to REJECT.
- Keep confidence between 0 and 1.
"""
response = query_engine.query(prompt)
result = parser.parse(response.response)
print(result.model_dump())
That gives you a typed result you can feed into policy checks. In practice I also keep raw evidence alongside parsed output so compliance can trace every field back to source text.
4) Apply insurance-specific rules before making a decision
Do not let the model make final compliance decisions on its own. Use deterministic rules after extraction so your underwriting or ops team can defend every outcome.
from datetime import datetime
def age_from_dob(dob_str: str) -> int:
dob = datetime.strptime(dob_str, "%Y-%m-%d").date()
today = datetime.utcnow().date()
return (today.year - dob.year) - ((today.month, today.day) < (dob.month, dob.day))
def decide_kyc(kyc: KYCResult) -> str:
if kyc.risk_flag == "REJECT":
return "REJECT"
if not kyc.full_name or not kyc.date_of_birth:
return "REVIEW"
if kyc.expiry_date:
expiry = datetime.strptime(kyc.expiry_date, "%Y-%m-%d").date()
if expiry < datetime.utcnow().date():
return "REJECT"
if age_from_dob(kyc.date_of_birth) < 18:
return "REJECT"
if kyc.confidence < 0.85:
return "REVIEW"
return "APPROVE"
decision = decide_kyc(result)
print({"decision": decision})
This separation is important in insurance because product eligibility often depends on age bands, residency rules, sanctions exposure, and local regulatory requirements. The LLM extracts; your code decides.
Production Considerations
- •
Keep data residency explicit
- •Store documents in-region if your insurance business operates under local residency constraints.
- •Make sure your vector store and logs do not replicate personal data across jurisdictions without approval.
- •
Log for auditability
- •Persist input document hashes, extracted fields, model version, prompt version, retrieval sources (
NodeIDs), and final decision. - •Regulators will ask why a customer was rejected; “the model said so” is not enough.
- •Persist input document hashes, extracted fields, model version, prompt version, retrieval sources (
- •
Add human-in-the-loop thresholds
- •Route low-confidence cases to manual review instead of forcing an automated outcome.
- •This is especially important for blurry scans, non-Latin scripts if unsupported in your pipeline, or inconsistent addresses across documents.
- •
Put guardrails around prompt injection
- •Treat uploaded docs as untrusted input.
- •Strip instructions embedded in PDFs like “ignore previous instructions” before sending content to the model.
Common Pitfalls
- •
Using the LLM as the policy engine
- •Mistake: asking the model to decide approval directly.
- •Fix: extract with LlamaIndex first, then apply deterministic business rules in Python.
- •
Skipping structured output validation
- •Mistake: parsing raw text responses with string matching.
- •Fix: use
PydanticOutputParserand validate required fields before downstream processing.
- •
Ignoring evidence retention
- •Mistake: storing only the final decision.
- •Fix: save source document references, extracted spans if available, prompt/version metadata, and reviewer overrides for full audit trails.
- •
Treating all cases as automatable
- •Mistake: auto-approving low-quality scans or incomplete records.
- •Fix: define clear REVIEW thresholds based on confidence, missing fields, expired IDs, mismatch signals, and jurisdiction-specific rules.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit