How to Build a KYC verification Agent Using LangChain in Python for pension funds
A KYC verification agent for pension funds automates the intake, validation, and escalation of member identity checks. It matters because pension administrators handle regulated customer data, need strong audit trails, and cannot afford inconsistent decisions on onboarding, transfers, or benefit payments.
Architecture
- •
Document ingestion layer
- •Accepts PDFs, scans, and structured forms from members or administrators.
- •Extracts text and metadata before any reasoning happens.
- •
KYC policy retrieval
- •Pulls pension-fund-specific rules from a controlled knowledge base.
- •Includes AML thresholds, acceptable ID types, address proof rules, and escalation criteria.
- •
LangChain agent
- •Uses an LLM to classify documents, identify missing fields, and decide whether a case is complete.
- •Must be constrained to tool calls and policy-grounded outputs.
- •
Validation tools
- •Verifies document completeness, checks expiry dates, and compares names across records.
- •Can integrate with external identity providers or internal member systems.
- •
Audit logging
- •Stores every input, tool call, retrieved policy snippet, and final decision.
- •Required for compliance reviews and dispute resolution.
- •
Human review queue
- •Escalates uncertain cases to operations or compliance staff.
- •Prevents automated approval when evidence is weak or conflicting.
Implementation
1) Install dependencies and define the policy model
Use LangChain with a chat model, structured output, and a retrieval layer for policy text. For production, keep your pension fund rules in a versioned store so every decision can be traced back to the exact policy revision.
pip install langchain langchain-openai langchain-community pydantic faiss-cpu
from typing import Literal
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
class KYCResult(BaseModel):
status: Literal["approved", "needs_review", "rejected"] = Field(...)
missing_items: list[str] = Field(default_factory=list)
risk_flags: list[str] = Field(default_factory=list)
rationale: str = Field(...)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(KYCResult)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a KYC analyst for a pension fund. Follow policy exactly."),
("user", """
Member data:
{name}
Documents:
{documents}
Policy context:
{policy}
Return only the structured result.
""")
])
2) Build a retrieval chain for pension fund policy
This example uses FAISS plus RetrievalQA-style retrieval through LangChain’s current primitives. The important part is that the agent does not invent policy; it retrieves the exact rule text first.
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document
from langchain_core.runnables import RunnablePassthrough
policy_docs = [
Document(page_content="Acceptable ID: passport or national ID. Must be valid and unexpired."),
Document(page_content="Proof of address must be issued within the last 3 months."),
Document(page_content="If name mismatch exists across documents, escalate to manual review."),
]
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(policy_docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
def format_docs(docs):
return "\n".join(doc.page_content for doc in docs)
policy_chain = (
{"policy": retriever | format_docs}
| RunnablePassthrough.assign()
)
3) Add a validation tool and an agentic decision flow
For pension funds, keep deterministic checks outside the LLM. Use Python for expiry logic, document presence checks, and cross-field comparisons. Let the model interpret ambiguous cases; let code enforce hard rules.
from datetime import datetime
def validate_documents(documents: dict) -> dict:
flags = []
if "id_number" not in documents:
flags.append("missing_id_number")
if documents.get("id_expiry"):
expiry = datetime.fromisoformat(documents["id_expiry"])
if expiry < datetime.utcnow():
flags.append("expired_id")
if documents.get("address_proof_age_days", 999) > 90:
flags.append("address_proof_outdated")
return {"flags": flags}
def build_kyc_input(member: dict) -> dict:
docs_check = validate_documents(member["documents"])
return {
"name": member["name"],
"documents": member["documents_text"],
"hard_flags": docs_check["flags"],
}
member = {
"name": "Jane Moyo",
"documents_text": "Passport attached. Utility bill attached.",
"documents": {
"id_number": "P1234567",
"id_expiry": "2027-08-01T00:00:00",
"address_proof_age_days": 45,
},
}
4) Run the chain and persist the decision for audit
This is the core pattern: retrieve policy context, apply deterministic checks first, then ask the LLM to classify using structured output. Store both inputs and outputs so compliance can reconstruct the decision later.
import json
def kyc_decide(member: dict):
hard_checks = validate_documents(member["documents"])
retrieved_policy = retriever.invoke(member["name"])
policy_text = format_docs(retrieved_policy)
input_payload = {
"name": member["name"],
"documents": member["documents_text"],
"policy": policy_text,
"hard_flags": ", ".join(hard_checks["flags"]) or "none",
}
chain = prompt | structured_llm
result = chain.invoke(input_payload)
audit_record = {
"member_name": member["name"],
"policy_used": policy_text,
"hard_flags": hard_checks["flags"],
"decision": result.model_dump(),
}
print(json.dumps(audit_record, indent=2))
return result
decision = kyc_decide(member)
print(decision.status)
Production Considerations
- •
Data residency
- •Keep member PII in-region if your pension fund operates under local residency requirements.
- •If you use hosted LLMs, verify where prompts and embeddings are processed and stored.
- •
Auditability
- •Log retrieved policy chunks, model version, prompt version, tool outputs, and final status.
- •Treat each KYC decision as a regulated event with immutable records.
- •
Guardrails
- •Never let the model approve cases without hard-rule checks passing first.
- •Use
temperature=0, structured output viawith_structured_output, and explicit escalation thresholds.
- •
Monitoring
- •Track approval rate by document type, manual-review rate, false reject rate, and retrieval quality.
- •Alert when the agent starts over-escalating or approving borderline cases too often.
Common Pitfalls
- •
Letting the LLM make compliance decisions alone
- •Avoid this by enforcing deterministic checks in Python before any model call.
- •The model should classify evidence quality, not override policy.
- •
Using stale or generic policies
- •Pension funds have fund-specific onboarding rules that change with regulation.
- •Version your policy documents and retrieve only from approved sources.
- •
Ignoring audit trail requirements
- •If you cannot reproduce why a case was approved or escalated, the system is not production-ready.
- •Persist prompts, retrieved context, outputs, timestamps, and operator overrides.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit