How to Build a document extraction Agent Using LlamaIndex in Python for payments
A document extraction agent for payments takes messy inbound files like invoices, bank statements, remittance advice, and payment instructions, then turns them into structured fields your downstream systems can trust. That matters because payment operations live or die on accuracy, traceability, and turnaround time; one bad extraction can mean a delayed settlement, a compliance issue, or a manual reconciliation headache.
Architecture
- •
Document ingestion layer
- •Pull PDFs, scans, emails, and image attachments from S3, blob storage, or an internal queue.
- •Normalize everything into
Documentobjects before extraction.
- •
OCR and text parsing
- •Use OCR for scanned files and native text extraction for digital PDFs.
- •Keep page-level metadata so you can trace every field back to source evidence.
- •
LlamaIndex extraction pipeline
- •Use
LlamaParseor file loaders to convert documents into structured text. - •Feed parsed content into an LLM-backed extractor with a strict schema.
- •Use
- •
Schema and validation layer
- •Define payment-specific fields like invoice number, amount, currency, beneficiary name, IBAN, SWIFT/BIC, due date.
- •Validate types, required fields, and business rules before writing to core systems.
- •
Audit and observability
- •Persist raw input hashes, extracted JSON, confidence signals, and source spans.
- •Keep an immutable audit trail for disputes and compliance reviews.
- •
Human review fallback
- •Route low-confidence or policy-flagged documents to an ops queue.
- •Never auto-post payment instructions without validation gates.
Implementation
1) Install dependencies and load documents
For production payment workflows, start with a parser that handles messy PDFs reliably. LlamaParse is the usual entry point when you need layout-aware parsing instead of plain text extraction.
pip install llama-index llama-parse pydantic
import os
from llama_index.core import SimpleDirectoryReader
# If you use LlamaParse in production:
# export LLAMA_CLOUD_API_KEY="your-key"
docs = SimpleDirectoryReader(
input_dir="./payment_docs",
recursive=True,
).load_data()
print(f"Loaded {len(docs)} documents")
If your source is scanned invoices or bank forms, keep the original files alongside the parsed text. Payments teams will ask where each field came from during audit.
2) Define a strict extraction schema
Use pydantic.BaseModel to force the model into a predictable output shape. For payments work, this is non-negotiable because downstream posting logic should not guess at missing or malformed fields.
from pydantic import BaseModel, Field
from typing import Optional
class PaymentDocument(BaseModel):
document_type: str = Field(description="Invoice, remittance advice, bank statement, etc.")
invoice_number: Optional[str] = Field(default=None)
amount: float = Field(description="Total amount on the document")
currency: str = Field(description="ISO currency code like USD or EUR")
beneficiary_name: Optional[str] = Field(default=None)
iban: Optional[str] = Field(default=None)
swift_bic: Optional[str] = Field(default=None)
due_date: Optional[str] = Field(default=None)
Keep the schema narrow. If your business only needs eight fields to route a payment exception correctly, do not ask the model for twenty-five.
3) Build the extractor with LlamaIndex
The cleanest pattern is Document -> SummaryIndex -> query engine with a structured prompt. In LlamaIndex Python APIs this gives you a reusable extraction flow without hand-rolling prompt glue.
from llama_index.core import SummaryIndex
from llama_index.core.prompts import PromptTemplate
EXTRACTION_PROMPT = PromptTemplate(
"""
You are extracting structured data from a payments document.
Return only information explicitly present in the document.
If a field is missing, use null.
Do not infer values.
Fields:
- document_type
- invoice_number
- amount
- currency
- beneficiary_name
- iban
- swift_bic
- due_date
Document:
{context_str}
"""
)
index = SummaryIndex.from_documents(docs)
query_engine = index.as_query_engine(text_qa_template=EXTRACTION_PROMPT)
response = query_engine.query("Extract the payment fields as JSON.")
print(response)
For stricter control in production, wrap this with post-processing that parses the response into PaymentDocument. If your model output drifts from JSON too often, move to function-calling or structured output support in your chosen LLM backend.
4) Validate and route results
Extraction is not done when the model returns text. It is done when the result passes validation and lands in the right operational path.
import json
from pydantic import ValidationError
def parse_payment_document(raw_text: str) -> PaymentDocument:
payload = json.loads(raw_text)
return PaymentDocument.model_validate(payload)
try:
extracted = parse_payment_document(str(response))
print(extracted.model_dump())
except (json.JSONDecodeError, ValidationError):
# Send to human review queue
print("Invalid extraction; route to manual review")
This is where payments-specific controls live:
- •Reject negative amounts unless your process supports credit notes.
- •Require ISO currency codes.
- •Check IBAN/SWIFT format before any downstream transfer instruction is created.
- •Store the source document ID with every extracted record for auditability.
Production Considerations
- •
Compliance and audit
- •Persist raw documents in compliant storage with retention policies aligned to finance regulations.
- •Store extracted payloads with timestamps, model version, prompt version, and source file checksum.
- •Make it easy to reconstruct why a field was accepted or rejected during dispute handling.
- •
Data residency
- •Keep parsing and inference in-region if documents contain bank details or personal data.
- •Avoid sending payment documents across regions unless legal and contractual controls are explicit.
- •If you use managed parsing services like LlamaParse or hosted LLMs, confirm residency guarantees first.
- •
Monitoring
- •Track field-level accuracy on sampled documents: amount mismatch rate, missing IBAN rate, manual review rate.
- •Alert on spikes in low-confidence extractions or malformed JSON responses.
- •Measure latency separately for OCR/parsing and LLM inference so you know where bottlenecks sit.
- •
Guardrails
- •Block auto-processing when required fields are absent or inconsistent with business rules.
- •Add deterministic checks for totals against line items when extracting invoices.
- •Never let free-form model output directly trigger payment execution.
Common Pitfalls
- •
Using plain text extraction on scanned PDFs
- •Scans need OCR-aware parsing. If you skip it, you will get empty or broken text and garbage extractions.
- •Use layout-aware parsing first; only fall back to plain text for digital PDFs with clean embedded text.
- •
Letting the model infer missing values
- •In payments workflows, inference becomes fraud risk fast.
- •Force nulls for absent fields and validate everything against source evidence before posting anything downstream.
- •
Skipping source traceability
- •If you cannot show which page produced an IBAN or amount field, audit will become painful.
- •Keep document IDs, page numbers if available through your parser metadata structure, raw text snippets where allowed, and model/version metadata alongside every extracted record.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit