How to Build a document extraction Agent Using LlamaIndex in Python for banking
A document extraction agent for banking reads PDFs, scans, statements, loan packs, KYC forms, and correspondence, then turns them into structured fields your downstream systems can use. The point is not just OCR; it is to reliably extract account numbers, customer names, dates, balances, signatures, and risk signals while keeping auditability, compliance, and data residency under control.
Architecture
- •
Document ingestion layer
- •Pulls files from S3, SharePoint, SFTP, or an internal case-management system.
- •Normalizes PDFs, images, and text documents into a consistent input format.
- •
Parsing and chunking layer
- •Uses
SimpleDirectoryReaderor custom loaders to read files. - •Splits content into manageable chunks with
SentenceSplitterso the LLM sees bounded context.
- •Uses
- •
Extraction engine
- •Uses a LlamaIndex query pipeline or structured output prompt to extract banking fields.
- •Returns JSON-like records instead of freeform text.
- •
Validation layer
- •Checks extracted values against schema rules: IBAN format, date ranges, currency precision, mandatory fields.
- •Rejects or flags low-confidence results for human review.
- •
Audit and traceability layer
- •Stores source document IDs, chunk IDs, prompts, model version, and extraction timestamps.
- •Gives compliance teams a full trail for each decision.
- •
Persistence layer
- •Writes validated results to Postgres, a case system, or a warehouse.
- •Keeps raw documents in-region to satisfy residency requirements.
Implementation
1) Load documents from disk or a controlled staging area
For banking workloads, start with a local or private storage path. SimpleDirectoryReader is the standard entry point when you want predictable ingestion.
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_dir="./bank_docs",
recursive=True
).load_data()
print(f"Loaded {len(documents)} documents")
print(documents[0].metadata)
If you are processing scanned PDFs, run OCR upstream before this step. LlamaIndex is the orchestration layer here; it does not replace your document capture pipeline.
2) Split documents into chunks that preserve context
Banking documents are dense. You want chunks large enough to keep fields together but small enough to stay within model limits. SentenceSplitter is the default workhorse for this.
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import Document
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
nodes = splitter.get_nodes_from_documents(
[Document(text=doc.text, metadata=doc.metadata) for doc in documents]
)
print(f"Created {len(nodes)} nodes")
print(nodes[0].text[:500])
This gives you node-level traceability. Keep the original document metadata attached so every extracted field can be traced back to its source file.
3) Build an extraction query engine with a strict prompt
Use VectorStoreIndex plus as_query_engine() when you need retrieval over long documents. For extraction tasks in banking, keep the prompt explicit and force structured output.
from llama_index.core import VectorStoreIndex
from llama_index.core.prompts import PromptTemplate
index = VectorStoreIndex(nodes)
extraction_prompt = PromptTemplate(
"""
You are extracting banking document fields.
Return only valid JSON with these keys:
customer_name, account_number, document_type, statement_date,
currency, opening_balance, closing_balance
Rules:
- If a field is missing, use null
- Do not guess
- Preserve exact numbers as strings when needed for identifiers
- Use evidence only from the provided context
Context:
{context_str}
Question:
Extract the required fields from this document.
"""
)
query_engine = index.as_query_engine(
similarity_top_k=5,
text_qa_template=extraction_prompt,
response_mode="compact"
)
response = query_engine.query("Extract the banking fields from this document.")
print(response.response)
This pattern works because retrieval narrows the context before generation. In practice, that reduces hallucinations on long statements and loan agreements.
4) Validate and persist the output
Do not trust raw model output in a bank. Parse it into a schema first, then write only validated records to storage.
import json
from pydantic import BaseModel, Field
from typing import Optional
class BankExtraction(BaseModel):
customer_name: Optional[str] = None
account_number: Optional[str] = None
document_type: Optional[str] = None
statement_date: Optional[str] = None
currency: Optional[str] = None
opening_balance: Optional[str] = None
closing_balance: Optional[str] = None
raw_text = response.response.strip()
data = json.loads(raw_text)
extraction = BankExtraction(**data)
print(extraction.model_dump())
At this point you can persist extraction.model_dump() into your case database together with documents[0].metadata, model name, timestamp, and operator ID. That audit record matters when compliance asks why a field was accepted.
Production Considerations
- •
Keep inference inside your residency boundary
- •Use self-hosted models or region-locked endpoints if customer data cannot leave jurisdiction.
- •Store raw documents and embeddings in-region as well.
- •
Log every extraction decision
- •Persist source document hash, chunk IDs from LlamaIndex nodes, prompt version, response payload, and validation outcome.
- •This is what makes post-trade review or KYC audit possible.
- •
Add deterministic guardrails
- •Reject outputs that fail regex checks for account numbers or dates.
- •Route low-confidence extractions to human review instead of auto-posting them downstream.
- •
Monitor drift by document type
- •Statements behave differently from loan applications or insurance claims.
- •Track accuracy per template type so one bad form redesign does not poison the whole pipeline.
Common Pitfalls
- •
Trying to extract directly from full documents
- •Large PDFs will produce noisy outputs and missed fields.
- •Chunk first with
SentenceSplitter, then retrieve relevant nodes before extraction.
- •
Letting the model invent missing values
- •Banks cannot tolerate guessed account numbers or balances.
- •Force nulls for missing data and validate every field against schema rules.
- •
Skipping metadata and audit trails
- •If you do not store source file IDs and node references, you cannot explain an extraction later.
- •Always persist the original document metadata alongside the parsed result.
A good banking extraction agent is boring in the right way: predictable inputs, strict outputs, strong validation. LlamaIndex gives you the orchestration primitives; your job is to wrap them in controls that satisfy compliance teams without slowing operations down.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit