How to Build a document extraction Agent Using LlamaIndex in Python for fintech

By Cyprian AaronsUpdated 2026-04-21

document-extractionllamaindexpythonfintech

A document extraction agent for fintech takes PDFs, scans, statements, invoices, KYC forms, and loan packages, then turns them into structured data you can validate and route into downstream systems. It matters because the business value is not “chat with your files”; it is reducing manual ops work while keeping compliance, auditability, and data handling under control.

Architecture

•
Document ingestion layer
- •Pulls files from S3, SharePoint, SFTP, or an internal upload API.
- •Normalizes inputs into Document objects before indexing.
•
Parsing and chunking layer
- •Uses LlamaIndex readers/loaders to extract text from PDFs and text-based docs.
- •Splits content into chunks that preserve page boundaries and metadata.
•
Extraction engine
- •Uses an LLM-backed query pipeline to extract specific fields like account number, invoice total, borrower name, or policy ID.
- •Returns structured outputs instead of free-form summaries.
•
Validation layer
- •Checks extracted values against regexes, enums, checksum logic, and business rules.
- •Flags low-confidence or conflicting fields for human review.
•
Audit and observability layer
- •Stores source document IDs, page numbers, prompt version, model version, and extracted payloads.
- •Supports replay for internal audit and regulator requests.
•
Storage/output layer
- •Writes approved results to a database, queue, or case management system.
- •Keeps raw documents in a controlled storage tier with retention policies.

Implementation

1) Install the right packages

For a production-grade extraction agent, keep the stack simple: LlamaIndex for orchestration and an LLM provider for generation. If you are using OpenAI as the backend during development:

pip install llama-index llama-index-llms-openai pydantic pypdf

Set your API key in the environment:

export OPENAI_API_KEY="your-key"

2) Load documents with metadata that survives the pipeline

In fintech, metadata is not optional. You need source IDs and page references so every extracted field can be traced back to origin.

from pathlib import Path
from llama_index.core import SimpleDirectoryReader

docs = SimpleDirectoryReader(
    input_dir="./documents",
    recursive=True,
    filename_as_id=True,
).load_data()

for doc in docs:
    doc.metadata["source_system"] = "loan_ops"
    doc.metadata["document_type"] = "application_packet"

If you need stronger PDF handling or OCR upstream, do that before LlamaIndex. The agent should receive clean text plus metadata; do not make the extractor solve bad ingestion.

3) Build an extraction prompt with structured output

Use PromptTemplate plus OpenAI to force deterministic field extraction. For fintech workflows, prefer a schema-like response that your app can validate immediately.

from pydantic import BaseModel
from typing import Optional
from llama_index.core import PromptTemplate
from llama_index.llms.openai import OpenAI

class LoanApplicationFields(BaseModel):
    applicant_name: str
    application_id: str
    requested_amount: float
    currency: str
    income_monthly: Optional[float] = None

prompt_tmpl = PromptTemplate(
    """You are extracting structured fields from a fintech document.

Return only valid JSON with these keys:
- applicant_name
- application_id
- requested_amount
- currency
- income_monthly

Rules:
- Use null if a field is missing.
- Do not guess.
- Preserve exact identifiers as written in the document.

Document text:
{context}
"""
)

llm = OpenAI(model="gpt-4o-mini", temperature=0)

def extract_fields(text: str) -> dict:
    prompt = prompt_tmpl.format(context=text)
    response = llm.complete(prompt)
    return LoanApplicationFields.model_validate_json(response.text).model_dump()

This pattern is boring on purpose. Boring wins in regulated environments because it is easier to test, audit, and explain.

4) Run extraction across documents and persist results

Use this loop as the core of your agent. In production you would batch it through a queue worker or task runner.

import json

results = []

for doc in docs:
    extracted = extract_fields(doc.text)
    result = {
        "document_id": doc.doc_id,
        "source_system": doc.metadata.get("source_system"),
        "document_type": doc.metadata.get("document_type"),
        "extracted": extracted,
    }
    results.append(result)

with open("extracted_loan_fields.json", "w") as f:
    json.dump(results, f, indent=2)

If you want retrieval over many documents before extraction, add a VectorStoreIndex and query relevant sections first. For single-document extraction tasks like loan packets or KYC forms, direct text-to-JSON is usually enough and easier to control.

Production Considerations

•
Data residency
- •Keep EU customer documents in EU-hosted infrastructure if your policy requires it.
- •Do not send raw PII to external services unless your legal/compliance team has signed off on the data processing terms.
•
Auditability
- •Store the exact prompt template version, model name, timestamp, document hash, and output JSON.
- •Keep page-level provenance so an auditor can trace every field back to source text.
•
Guardrails
- •Reject outputs that fail schema validation or violate business rules.
- •Example: account numbers should match expected length; currency should be from an approved enum; totals should be non-negative.
•
Monitoring
- •Track extraction accuracy by document type and field.
- •Alert on spikes in null fields, malformed JSON, or model drift after prompt changes.

Common Pitfalls

•
Treating OCR as part of the LLM problem
- •Bad scans need OCR preprocessing before LlamaIndex sees them.
- •If you feed garbage text into the extractor, you will get confident garbage out.
•
Skipping schema validation
- •Never trust raw model output directly.
- •Use pydantic models like LoanApplicationFields so invalid responses fail fast instead of entering downstream finance systems.
•
Losing provenance
- •If you do not store document IDs and metadata per extraction run, audit becomes painful.
- •Always persist source references alongside extracted values.
•
Using one generic prompt for every document type
- •Invoice extraction and KYC extraction are different problems.
- •Maintain separate prompts and schemas per document class so your precision stays high enough for operations teams to trust it.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit