How to Build a document extraction Agent Using LlamaIndex in Python for retail banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionllamaindexpythonretail-banking

A document extraction agent in retail banking takes messy PDFs, scans, statements, KYC forms, and application packs, then turns them into structured fields your downstream systems can trust. It matters because manual extraction is slow, expensive, and error-prone, and in banking those errors become compliance issues, bad decisions, and audit problems.

Architecture

  • Document ingestion layer

    • Pulls PDFs, images, or text files from S3, SharePoint, SFTP, or internal case management systems.
    • Normalizes file types before extraction.
  • Parsing and OCR layer

    • Uses SimpleDirectoryReader for text-based docs.
    • Uses OCR upstream for scanned statements and IDs before handing text to LlamaIndex.
  • Extraction schema

    • Defines the fields you care about: customer name, account number, address, income, employer, document type, issue date.
    • Keeps output stable for downstream validation.
  • LLM-powered extractor

    • Uses LlamaIndex SummaryIndex or direct query patterns with a structured prompt.
    • Produces JSON-like outputs that map to your schema.
  • Validation and policy layer

    • Applies regex checks, checksum rules, date validation, and business rules.
    • Blocks unsafe outputs and flags low-confidence extractions for review.
  • Audit and storage layer

    • Stores source document references, extracted fields, model version, prompt version, and timestamp.
    • Supports audit trails and data residency requirements.

Implementation

1) Install the right packages

You need LlamaIndex plus a local or hosted LLM backend. For banking workflows I prefer explicit control over the model provider and deterministic settings where possible.

pip install llama-index llama-index-llms-openai pydantic

If your bank requires regional deployment or private networking, swap OpenAI for a hosted model inside your approved environment. The LlamaIndex pattern stays the same.

2) Load documents from a controlled directory

For retail banking intake, start with a folder of already-approved documents. In production this folder is usually a staging area fed by an ingestion service that handles malware scanning and OCR.

from llama_index.core import SimpleDirectoryReader

docs = SimpleDirectoryReader(
    input_dir="./bank_docs",
    recursive=True,
).load_data()

print(f"Loaded {len(docs)} documents")

SimpleDirectoryReader gives you Document objects that LlamaIndex can index or query. If you are processing scanned PDFs, run OCR first and store the extracted text as .txt or preprocessed .md files.

3) Build an extraction index and query it with a strict schema

For document extraction agents I like to keep the schema explicit in the prompt. You can use SummaryIndex when the task is to extract structured facts from one document at a time.

import json
from pydantic import BaseModel, Field
from llama_index.core import SummaryIndex
from llama_index.llms.openai import OpenAI

class RetailBankDocFields(BaseModel):
    customer_name: str = Field(description="Full legal name of the customer")
    account_number: str = Field(description="Primary account number if present")
    document_type: str = Field(description="Type of document such as bank_statement or payslip")
    issue_date: str = Field(description="Document issue date in ISO format if available")
    address: str | None = Field(default=None, description="Customer address if present")

llm = OpenAI(model="gpt-4o-mini", temperature=0)

index = SummaryIndex.from_documents(docs)
query_engine = index.as_query_engine(llm=llm)

prompt = """
Extract the requested fields from this retail banking document.
Return only valid JSON matching this schema:
{
  "customer_name": "...",
  "account_number": "...",
  "document_type": "...",
  "issue_date": "...",
  "address": "..."
}

Rules:
- If a field is missing, return null.
- Do not guess.
- Use ISO date format when possible.
- Prefer values explicitly stated in the document.
"""

response = query_engine.query(prompt)
raw_text = str(response)

print(raw_text)

This pattern is simple but effective when you need traceable extraction from a single source document. In production you would usually process one file at a time rather than mixing multiple cases into one index.

4) Validate the output before it reaches core banking systems

Never trust model output directly. Parse it into your Pydantic schema and reject anything that fails validation or business rules.

from pydantic import ValidationError

def validate_extraction(payload: str) -> RetailBankDocFields:
    data = json.loads(payload)
    return RetailBankDocFields.model_validate(data)

try:
    extracted = validate_extraction(raw_text)
    print(extracted.model_dump())
except (json.JSONDecodeError, ValidationError) as e:
    print("Extraction rejected:", e)

Add domain checks on top:

  • Account number length and format
  • Date not in the future
  • Customer name must match application record within tolerance
  • Address must be non-empty for KYC cases

That separation matters. The model extracts; your policy engine decides whether the result is acceptable.

Production Considerations

  • Data residency

    • Keep documents and embeddings inside approved regions.
    • If policy says customer data cannot leave country boundaries, use an on-prem or region-bound model endpoint.
  • Auditability

    • Log document ID, hash of source file, prompt version, model name, response payload hash, and validation outcome.
    • Store enough metadata to reconstruct why a field was accepted or rejected.
  • Monitoring

    • Track extraction accuracy by document type: statements vs payslips vs ID documents.
    • Alert on spikes in null fields, malformed JSON responses, or high manual-review rates.
  • Guardrails

    • Block unsupported doc types early.
    • Redact sensitive values from logs where possible.
    • Use human-in-the-loop review for low-confidence cases like ambiguous addresses or overwritten statement values.

Common Pitfalls

  1. Trying to extract from raw scans without OCR

    • LlamaIndex will not fix unreadable inputs.
    • Run OCR first and preserve page-level text so reviewers can trace each field back to its source.
  2. Letting the model free-form its output

    • Free text responses are hard to validate and impossible to automate safely.
    • Force JSON-shaped output with strict parsing using Pydantic or equivalent validators.
  3. Skipping compliance metadata

    • If you do not store source hashes, timestamps, model versions, and prompt versions, your audit trail is weak.
    • In retail banking that becomes a governance problem during investigations or regulatory reviews.
  4. Mixing documents from different customers in one extraction pass

    • That increases cross-document contamination and wrong-field attribution.
    • Process one case bundle per workflow instance unless you have strong segmentation logic keyed by case ID.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides