How to Build a KYC verification Agent Using LlamaIndex in Python for investment banking

By Cyprian AaronsUpdated 2026-04-21
kyc-verificationllamaindexpythoninvestment-banking

A KYC verification agent checks customer identity documents, screens them against policy and regulatory rules, and produces an auditable decision trail for onboarding teams. In investment banking, that matters because bad KYC creates regulatory exposure, delayed account opening, and expensive manual reviews.

Architecture

Build this agent as a small workflow with clear boundaries:

  • Document ingestion layer

    • Accept PDFs, scanned IDs, proof of address, corporate registries, and supporting files.
    • Normalize them into text using SimpleDirectoryReader or your own OCR pipeline before indexing.
  • Knowledge index

    • Store KYC policy manuals, jurisdiction rules, escalation playbooks, and product-specific onboarding requirements.
    • Use VectorStoreIndex so the agent can retrieve the exact policy passages behind every decision.
  • Retrieval + reasoning layer

    • Use a query engine to pull relevant policy context for each case.
    • Keep reasoning constrained to retrieved evidence and structured customer data.
  • Decision output layer

    • Return a strict JSON object with fields like status, missing_docs, risk_flags, and rationale.
    • This makes downstream case management and audit logging predictable.
  • Audit and trace layer

    • Persist prompts, retrieved chunks, model outputs, and final decisions.
    • Investment banking teams need this for compliance review and internal audit.
  • Human escalation path

    • Route ambiguous cases to an analyst when confidence is low or when sanctions/PEP signals appear.
    • Do not let the agent auto-approve edge cases.

Implementation

1) Load KYC policy documents into a LlamaIndex index

Start with your internal KYC policies, onboarding checklists, and jurisdiction-specific rules. In production you would likely extract text from PDFs first; here we use plain text files to keep the pattern concrete.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.settings import Settings
from llama_index.llms.openai import OpenAI

# Configure the LLM used by the query engine
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

# Load policy docs from a directory
documents = SimpleDirectoryReader("./kyc_policies").load_data()

# Build the index over policy material
index = VectorStoreIndex.from_documents(documents)

# Create a retriever-backed query engine
query_engine = index.as_query_engine(similarity_top_k=3)

This gives you retrieval over the bank’s actual onboarding policies. The important part is that the model answers from policy text, not memory.

2) Define a structured KYC assessment prompt

You want deterministic output that downstream systems can validate. For investment banking, JSON beats free-form prose every time.

from pydantic import BaseModel, Field
from typing import List

class KYCResult(BaseModel):
    status: str = Field(description="APPROVE, REJECT, or ESCALATE")
    missing_docs: List[str]
    risk_flags: List[str]
    rationale: str

def build_kyc_prompt(customer_name: str, country: str, docs_present: list[str]) -> str:
    return f"""
Assess this investment banking KYC case using only the retrieved policy context.

Customer name: {customer_name}
Country: {country}
Documents present: {docs_present}

Return a concise decision with:
- status
- missing_docs
- risk_flags
- rationale

If sanctions/PEP/UBO ambiguity exists or required docs are missing, prefer ESCALATE.
"""

The model should not invent requirements. The prompt forces it to stay inside the retrieved policy context and to escalate uncertain cases.

3) Run retrieval and produce an auditable decision

Use query_engine.query() to pull relevant policy snippets before generating the final assessment. Then validate the response shape before saving it to your case system.

import json

def assess_kyc_case(customer_name: str, country: str, docs_present: list[str]) -> dict:
    prompt = build_kyc_prompt(customer_name, country, docs_present)
    response = query_engine.query(prompt)

    raw_text = str(response)

    # In production you would use structured output parsing here.
    # Keep this as a strict contract with your downstream workflow.
    result = {
        "customer_name": customer_name,
        "country": country,
        "docs_present": docs_present,
        "retrieved_context": response.source_nodes[0].node.get_content() if response.source_nodes else "",
        "decision_text": raw_text,
    }
    return result

case = assess_kyc_case(
    customer_name="Acme Capital Ltd",
    country="GB",
    docs_present=["certificate_of_incorporation", "passport", "utility_bill"]
)

print(json.dumps(case, indent=2))

That pattern gives you retrieval evidence plus a decision artifact. For regulated onboarding flows, storing source_nodes is non-negotiable because it shows which policy passages influenced the outcome.

4) Add stricter structured parsing for production use

For real deployment, wrap the model output in a schema validator. LlamaIndex supports structured outputs through response handling patterns; if you need tighter control, validate against Pydantic after generation and reject malformed results.

from pydantic import ValidationError

def validate_result(payload: dict) -> KYCResult:
    try:
        return KYCResult(**payload)
    except ValidationError as e:
        raise ValueError(f"Invalid KYC result schema: {e}")

# Example downstream usage
structured_payload = {
    "status": "ESCALATE",
    "missing_docs": ["proof_of_address"],
    "risk_flags": ["corporate_structure_requires_ubo_review"],
    "rationale": "Policy requires proof of address and UBO verification for corporate accounts."
}

validated = validate_result(structured_payload)
print(validated.model_dump())

This is where most teams get serious about reliability. If the agent cannot emit valid structured output, it should fail closed and send the case to operations.

Production Considerations

  • Auditability

    • Log every prompt, retrieved chunk ID, model version, final decision, and analyst override.
    • Keep immutable records for compliance review and regulator requests.
  • Data residency

    • Host embeddings, vector stores, and LLM endpoints in approved regions only.
    • Investment banking clients often require strict controls around cross-border data transfer.
  • Guardrails

    • Hard-block auto-approval when sanctions screening is incomplete or UBO ownership is unclear.
    • Add rule-based checks outside the LLM for mandatory documents and jurisdiction-specific thresholds.
  • Monitoring

    • Track escalation rate, false approvals caught by analysts, retrieval quality, and latency per case.
    • Alert on drift when document types or geographies change materially.

Common Pitfalls

  1. Letting the model decide without policy retrieval

    • Fix this by forcing every answer through VectorStoreIndex retrieval from approved KYC documents.
    • No retrieved evidence means no decision.
  2. Using free-form output in downstream systems

    • Free text breaks automation and audit trails.
    • Require a strict schema like KYCResult and reject malformed responses.
  3. Ignoring jurisdiction differences

    • UK private banks do not onboard exactly like US broker-dealers or APAC wealth desks.
    • Split policies by region and product line before indexing them.
  4. Treating escalation as failure

    • Escalation is the correct outcome for ambiguous ownership chains or missing identity proof.
    • In investment banking KYC, conservative routing protects both compliance and client trust.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides