How to Build a document extraction Agent Using LlamaIndex in Python for lending

By Cyprian AaronsUpdated 2026-04-21

document-extractionllamaindexpythonlending

A document extraction agent for lending reads borrower files, pulls out the fields your workflow needs, and turns messy PDFs into structured data you can underwrite against. In practice, that means extracting things like borrower name, employer, income, liabilities, bank statements, and missing-document status without forcing an ops team to manually review every file.

Architecture

•
Document ingestion layer
- •Accepts PDFs, images, and Office docs from loan origination systems or secure object storage.
- •Normalizes files into Document objects before indexing.
•
Parsing and OCR layer
- •Uses LlamaIndex readers/loaders to extract text from native PDFs.
- •Routes scanned documents through OCR before extraction.
•
Extraction engine
- •Uses an LLM-backed structured extractor to map text into a lending schema.
- •Produces JSON-like outputs for downstream underwriting rules.
•
Validation layer
- •Checks required fields, formats, and cross-field consistency.
- •Flags low-confidence or incomplete extractions for human review.
•
Audit and traceability layer
- •Stores source document IDs, page references, and extraction timestamps.
- •Keeps a defensible trail for compliance reviews.
•
Integration layer
- •Pushes validated results into LOS, CRM, or decisioning services.
- •Supports retries and dead-letter handling for failed extractions.

Implementation

1) Install the right packages

You need LlamaIndex core plus a reader for PDFs. For production lending workflows, keep dependencies explicit so your security team can review them.

pip install llama-index llama-index-readers-file pydantic

If you are using OpenAI as the model provider:

pip install llama-index-llms-openai
export OPENAI_API_KEY="your-key"

2) Load documents into LlamaIndex

Use SimpleDirectoryReader for batch ingestion from a controlled folder or mounted bucket. This is enough for a first production pass if your upstream system already handles file validation.

from llama_index.core import SimpleDirectoryReader

docs = SimpleDirectoryReader(
    input_dir="./loan_packages",
    recursive=True
).load_data()

print(f"Loaded {len(docs)} documents")

For lending, do not mix unrelated files in the same directory. Keep one loan package per folder or per case ID so audit trails stay clean.

3) Define the extraction schema and build a structured extractor

Use Pydantic models to force consistent output. This is where you turn unstructured loan docs into something underwriting code can trust.

from typing import List, Optional
from pydantic import BaseModel, Field
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.core.indices.struct_store import JSONQueryEngine

class LoanApplicationFields(BaseModel):
    borrower_name: Optional[str] = Field(default=None)
    employer_name: Optional[str] = Field(default=None)
    monthly_income: Optional[float] = Field(default=None)
    total_monthly_debt: Optional[float] = Field(default=None)
    property_address: Optional[str] = Field(default=None)
    requested_loan_amount: Optional[float] = Field(default=None)
    missing_documents: List[str] = Field(default_factory=list)

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

# The extractor pattern uses structured prediction over source text.
# In lending, temperature=0 matters because you want stable outputs.

LlamaIndex’s actual extraction flow is usually built around structured prediction or query engines over indexed content. If your documents are long, index them first; if they are small packet-level docs, direct extraction is fine.

4) Extract fields from each document and validate them

This example uses PydanticProgramExtractor, which is the cleanest pattern when you want typed outputs from document text. It gives you a predictable structure that downstream systems can consume.

from llama_index.core.extractors import PydanticProgramExtractor
from llama_index.core.schema import Document

extractor = PydanticProgramExtractor(
    llm=Settings.llm,
    output_cls=LoanApplicationFields,
    prompt_template_str=(
        "Extract lending fields from the following document text.\n"
        "Return only values supported by the text.\n\n"
        "Document:\n{context_str}"
    ),
)

loan_docs = [Document(text=d.text, metadata=d.metadata) for d in docs]
extracted_nodes = extractor.extract(loan_docs)

for item in extracted_nodes:
    print(item)

A practical production pattern is to attach metadata before extraction:

for d in loan_docs:
    d.metadata["case_id"] = "LN-10482"
    d.metadata["source_system"] = "broker_portal"

Then persist both the raw text reference and extracted output. That gives compliance teams an audit path back to the original file.

Production Considerations

•
Compliance and auditability
- •Store source document hashes, case IDs, model version, prompt version, and timestamps.
- •Keep extraction logs immutable so reviewers can reconstruct how a field was produced.
•
Data residency
- •Keep document processing inside the required region if you handle regulated borrower data.
- •If your lender operates across jurisdictions, route EU/UK/US data to region-specific deployments.
•
Guardrails
- •Reject extractions that invent values not present in source text.
- •Add field-level validation for amounts, dates, SSNs/TINs, and employment history before passing results to underwriting.
•
Monitoring
- •Track extraction success rate by doc type: pay stubs, bank statements, W-2s, tax returns.
- •Alert on spikes in missing fields or low-confidence outputs because those usually mean OCR drift or template changes.

Common Pitfalls

•
Using free-form prompts instead of typed schemas
- •This produces inconsistent JSON and breaks downstream rules engines.
- •Use PydanticProgramExtractor or another typed output path every time you can.
•
Skipping OCR strategy for scanned documents
- •Native PDF parsing will fail on image-based bank statements and pay stubs.
- •Detect scanned pages early and run OCR before sending text to LlamaIndex.
•
Ignoring provenance
- •If you cannot show where monthly_income came from on page 3 of a pay stub, compliance will reject the workflow.
- •Persist source metadata alongside each extracted field and keep the raw document reference intact.
•
Letting the model infer missing data
- •Lending workflows need evidence-based extraction, not guesswork.
- •Treat absent values as null, then route them to exception handling instead of filling them in automatically.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit