How to Build a document extraction Agent Using LangChain in Python for payments

By Cyprian AaronsUpdated 2026-04-21

document-extractionlangchainpythonpayments

A document extraction agent for payments reads invoices, bank statements, remittance advices, and payment requests, then turns them into structured data your payment system can validate and process. It matters because the bottleneck in payments is rarely the transfer itself; it is the manual review, keying errors, and exception handling around unstructured documents.

Architecture

•
Document ingestion layer
- •Accept PDFs, scans, email attachments, or OCR output.
- •Normalize file metadata early: source system, tenant, jurisdiction, and retention policy.
•
Text extraction layer
- •Use a loader like PyPDFLoader for digital PDFs.
- •For scanned docs, route OCR output into the same downstream pipeline.
•
Extraction chain
- •Use ChatPromptTemplate plus a structured output model.
- •Force the LLM to return payment fields like invoice number, amount, currency, beneficiary name, IBAN/account number, due date, and confidence notes.
•
Validation layer
- •Validate extracted fields against business rules.
- •Check currency formats, totals, duplicate invoice IDs, sanctioned entities, and country-specific account formats.
•
Audit and storage layer
- •Persist raw text, extracted JSON, model version, prompt version, and timestamps.
- •Keep an immutable audit trail for compliance reviews and dispute handling.
•
Human review queue
- •Route low-confidence or policy-flagged documents to ops staff.
- •Never auto-release payment instructions from unverified extraction alone.

Implementation

1) Install the right packages

Use LangChain split packages. For OpenAI-backed extraction you need the core chain primitives plus the provider package.

pip install langchain langchain-openai langchain-community pydantic pypdf

Set your API key before running anything:

export OPENAI_API_KEY="your-key"

2) Define the payment schema you want back

For payments work, don’t ask for “summary” or “important details”. Define exact fields that your downstream system can validate.

from typing import Optional
from pydantic import BaseModel, Field

class PaymentDocument(BaseModel):
    document_type: str = Field(description="Invoice, remittance advice, bank statement, or payment request")
    vendor_name: Optional[str] = Field(default=None)
    invoice_number: Optional[str] = Field(default=None)
    invoice_date: Optional[str] = Field(default=None)
    due_date: Optional[str] = Field(default=None)
    amount: Optional[float] = Field(default=None)
    currency: Optional[str] = Field(default=None)
    beneficiary_account: Optional[str] = Field(default=None)
    beneficiary_name: Optional[str] = Field(default=None)
    reference: Optional[str] = Field(default=None)
    confidence_notes: str = Field(description="Short notes on ambiguity or missing fields")

3) Build a LangChain extraction chain

This pattern uses ChatOpenAI, ChatPromptTemplate, and with_structured_output(). That gives you typed output instead of brittle free-text parsing.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("sample_invoice.pdf")
docs = loader.load()

text = "\n\n".join([doc.page_content for doc in docs])

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You extract payment-relevant fields from financial documents. "
     "Return only fields defined in the schema. "
     "If a field is missing or unclear, set it to null and explain briefly in confidence_notes."),
    ("human", "Extract structured data from this document:\n\n{text}")
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

chain = prompt | llm.with_structured_output(PaymentDocument)

result = chain.invoke({"text": text})
print(result.model_dump())

4) Add deterministic validation before routing to payments

LLM extraction is not enough. Payments systems need rule-based checks before anything hits ERP or treasury workflows.

def validate_payment(doc: PaymentDocument) -> list[str]:
    errors = []

    if doc.amount is None or doc.amount <= 0:
        errors.append("amount must be present and greater than zero")

    if not doc.currency or len(doc.currency) != 3:
        errors.append("currency must be a valid ISO-4217 code")

    if doc.document_type.lower() not in {"invoice", "remittance advice", "bank statement", "payment request"}:
        errors.append("unsupported document_type")

    if doc.beneficiary_account is None and doc.document_type.lower() == "payment request":
        errors.append("payment requests require beneficiary_account")

    return errors

errors = validate_payment(result)
if errors:
    print({"status": "review", "errors": errors})
else:
    print({"status": "ready_for_payment", "data": result.model_dump()})

Production Considerations

•
Compliance first
- •Store prompt version, model version, raw input text hash, extracted output, and reviewer actions.
- •For PCI DSS-adjacent flows or sensitive banking docs, redact card data and account numbers where possible before logging.
•
Data residency
- •Keep document processing inside approved regions.
- •If your bank operates across jurisdictions, route EU documents to EU-hosted infrastructure and keep retention policies region-specific.
•
Monitoring
- •Track extraction accuracy by document type.
- •Monitor null rates on critical fields like amount, currency, invoice_number, and beneficiary_account.
- •Alert on spikes in human review rate; that usually means OCR quality dropped or templates changed upstream.
•
Guardrails
- •Block auto-processing when confidence notes mention ambiguity around amount or beneficiary details.
- •Add sanctions screening and duplicate-payment checks after extraction but before release.
- •Never let the model infer missing account numbers from context; missing means missing.

Common Pitfalls

•
Using free-text output instead of structured output
- •This breaks as soon as templates vary.
- •Fix it with with_structured_output() and a strict Pydantic schema.
•
Skipping validation after extraction
- •LLMs can produce plausible but invalid values like malformed currencies or swapped totals.
- •Fix it with deterministic checks for format, range, duplicates, and jurisdiction rules.
•
Logging sensitive payloads everywhere
- •Payment docs often contain bank details and personal data.
- •Fix it by redacting logs, hashing raw inputs for traceability, and storing full content only in controlled systems with audit access.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit