How to Build a document extraction Agent Using LangChain in Python for lending

By Cyprian AaronsUpdated 2026-04-21

document-extractionlangchainpythonlending

A document extraction agent for lending reads borrower documents like bank statements, payslips, tax returns, and ID scans, then turns them into structured fields your underwriting system can use. It matters because lending decisions depend on speed, consistency, and auditability, and manual review of PDFs is where most bottlenecks and errors show up.

Architecture

•
Document ingestion layer
- •Accepts PDFs, images, and scanned files from application portals or internal queues.
- •Normalizes file metadata like applicant ID, document type, and submission timestamp.
•
OCR and text extraction
- •Uses PyPDFLoader, UnstructuredPDFLoader, or image OCR upstream to get raw text.
- •Handles low-quality scans before the LLM sees anything.
•
Field extraction chain
- •Uses a ChatPromptTemplate plus a structured output model to extract lending fields.
- •Returns JSON with values like employer name, net income, account balance, and statement period.
•
Validation layer
- •Checks extracted values against business rules.
- •Flags missing dates, inconsistent totals, or suspicious values for manual review.
•
Audit and traceability store
- •Persists source document references, model outputs, confidence notes, and versioned prompts.
- •Required for compliance reviews and dispute handling.

Implementation

1) Install the right packages

Use LangChain core plus a provider integration. For this example I’ll use OpenAI through LangChain because the structured output API is straightforward.

pip install langchain langchain-openai pydantic pypdf

Set your key in the environment:

export OPENAI_API_KEY="your-key"

2) Define the schema you want from lending documents

Don’t extract free-form text if your downstream system expects fixed fields. Use Pydantic so LangChain can validate the output before you write it anywhere.

from typing import Optional
from pydantic import BaseModel, Field

class LendingDocumentFields(BaseModel):
    document_type: str = Field(description="Type of document such as bank_statement or payslip")
    full_name: str = Field(description="Borrower's full legal name")
    employer_name: Optional[str] = Field(default=None, description="Employer name if present")
    monthly_income: Optional[float] = Field(default=None, description="Monthly gross or net income")
    account_balance: Optional[float] = Field(default=None, description="Ending balance for bank statements")
    statement_period_start: Optional[str] = Field(default=None, description="Statement period start date in ISO format")
    statement_period_end: Optional[str] = Field(default=None, description="Statement period end date in ISO format")
    confidence_notes: str = Field(description="Short notes about ambiguity or missing data")

3) Build the extraction chain with LangChain

This pattern uses PyPDFLoader, ChatOpenAI, ChatPromptTemplate, and with_structured_output(). It is a clean production shape because the model must return data that matches your schema.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("sample_bank_statement.pdf")
docs = loader.load()

text = "\n\n".join(page.page_content for page in docs)

prompt = ChatPromptTemplate.from_messages([
    ("system", """
You extract fields from lending documents.
Return only data supported by the source text.
If a field is missing, set it to null.
Use ISO dates when possible.
"""),
    ("human", """
Extract structured lending fields from this document:

{document_text}
""")
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(LendingDocumentFields)

chain = prompt | structured_llm

result = chain.invoke({"document_text": text})
print(result.model_dump())

4) Add deterministic validation before passing data downstream

LLM output is not your final truth. For lending workflows you need rule checks that catch obvious failures before underwriting consumes the record.

from datetime import datetime

def validate_lending_fields(data: LendingDocumentFields) -> list[str]:
    errors = []

    if not data.full_name.strip():
        errors.append("full_name is empty")

    if data.monthly_income is not None and data.monthly_income <= 0:
        errors.append("monthly_income must be positive")

    if data.statement_period_start and data.statement_period_end:
        start = datetime.fromisoformat(data.statement_period_start)
        end = datetime.fromisoformat(data.statement_period_end)
        if start > end:
            errors.append("statement period start is after end")

    return errors

errors = validate_lending_fields(result)
if errors:
    print({"status": "review_required", "errors": errors})
else:
    print({"status": "approved_for_downstream", "data": result.model_dump()})

Production Considerations

•
Keep document processing inside approved regions
- •Lending data often has residency requirements.
- •Pin model endpoints and storage buckets to the correct region and log where each document was processed.
•
Persist full audit trails
- •Store source file hash, extracted text version, prompt version, model name, output payload, and validation results.
- •If a borrower disputes a decision later, you need to reconstruct exactly what happened.
•
Add human review thresholds
- •Route low-confidence cases to ops when key fields are missing or contradictory.
- •Examples: income extracted but no pay frequency; bank statement balance present but no statement period; mismatched names across documents.
•
Monitor extraction quality by document type
- •Track field-level accuracy separately for bank statements, payslips, tax forms, and IDs.
- •A single aggregate score hides failures that matter in credit policy.

Common Pitfalls

•
Using raw LLM output without schema validation
- •This leads to malformed JSON and silent bad data entering underwriting.
- •Fix it with with_structured_output() plus Pydantic validation before persistence.
•
Skipping OCR quality checks on scanned PDFs
- •Bad scans produce garbage text and the model will confidently extract nonsense.
- •Run preprocessing first and reject pages with unreadable text density or broken layout.
•
Treating all missing fields as equal
- •Missing employer name on a payslip is not the same as missing account balance on a bank statement.
- •Build document-type-specific rules so your review queue focuses on material gaps.
•
Ignoring compliance metadata
- •If you don’t store prompt versioning, region info, and source hashes, audits become painful fast.
- •Keep those fields alongside every extraction record from day one.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit