How to Build a document extraction Agent Using LangChain in Python for fintech
A document extraction agent turns messy financial documents into structured data you can validate, store, and route into downstream systems. In fintech, that means pulling fields from bank statements, invoices, KYC forms, loan applications, and trade confirmations without hand-keying every record.
Architecture
- •
Document ingestion layer
- •Accept PDFs, scanned images, and text-based files.
- •Normalize inputs before extraction so OCR and parsing behave consistently.
- •
Text extraction layer
- •Use
PyPDFLoaderfor digital PDFs. - •Use OCR for scanned documents before passing text into LangChain.
- •Use
- •
Extraction chain
- •Use a chat model with structured output.
- •Map raw text into a typed schema with
PydanticOutputParserorwith_structured_output().
- •
Validation layer
- •Check required fields, formats, totals, and business rules.
- •Reject or flag records that fail compliance checks.
- •
Audit and storage layer
- •Persist the original document hash, extracted JSON, model version, and timestamp.
- •Keep an immutable audit trail for regulatory review.
- •
Human review queue
- •Route low-confidence or malformed outputs to ops teams.
- •Never auto-post sensitive financial data without validation.
Implementation
1) Define the extraction schema
Start with a strict schema. Fintech extraction fails when you treat all outputs as free text.
from typing import List
from pydantic import BaseModel, Field
class InvoiceLineItem(BaseModel):
description: str = Field(..., description="Line item description")
quantity: float = Field(..., ge=0)
unit_price: float = Field(..., ge=0)
amount: float = Field(..., ge=0)
class InvoiceExtraction(BaseModel):
vendor_name: str
invoice_number: str
invoice_date: str
currency: str
subtotal: float
tax: float
total: float
line_items: List[InvoiceLineItem]
2) Load the document and extract text
For PDFs with embedded text, PyPDFLoader is enough. For scanned docs, add OCR upstream and feed the OCR text into the same chain.
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("invoice.pdf")
documents = loader.load()
raw_text = "\n".join(doc.page_content for doc in documents)
3) Build a structured extraction chain
This pattern uses ChatOpenAI plus with_structured_output(). It is cleaner than hand-parsing model output and gives you typed results you can validate.
import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(
model="gpt-4o-mini",
temperature=0,
api_key=os.environ["OPENAI_API_KEY"],
)
prompt = ChatPromptTemplate.from_messages([
("system",
"You extract invoice fields from financial documents. "
"Return only data that is explicitly present in the document. "
"If a field is missing, leave it blank or null."),
("user", "{document_text}")
])
structured_llm = llm.with_structured_output(InvoiceExtraction)
chain = prompt | structured_llm
result = chain.invoke({"document_text": raw_text})
print(result.model_dump())
That gives you a validated Python object instead of brittle JSON string parsing. If the model returns malformed values, Pydantic catches it before your pipeline writes to storage or ERP systems.
4) Add business-rule validation before downstream use
Fintech needs more than field extraction. You also need reconciliation checks so bad data does not enter payment or ledger flows.
def validate_invoice(invoice: InvoiceExtraction) -> list[str]:
errors = []
computed_total = round(invoice.subtotal + invoice.tax, 2)
if round(invoice.total, 2) != computed_total:
errors.append(
f"Total mismatch: expected {computed_total}, got {invoice.total}"
)
if not invoice.invoice_number.strip():
errors.append("Missing invoice number")
if invoice.currency not in {"USD", "EUR", "GBP"}:
errors.append(f"Unsupported currency: {invoice.currency}")
return errors
issues = validate_invoice(result)
if issues:
print({"status": "needs_review", "issues": issues})
else:
print({"status": "approved", "data": result.model_dump()})
Production Considerations
- •
Compliance and auditability
- •Store the source file checksum, extracted payload, prompt version, model name, and response timestamp.
- •This gives you traceability for SOC 2, internal audit, and regulator requests.
- •
Data residency
- •Keep documents in-region if your policy requires it.
- •If you process EU banking data or regulated customer records, make sure your model endpoint and storage location match residency requirements.
- •
Monitoring
- •Track extraction accuracy by document type.
- •Monitor rejection rates, missing-field rates, and validation failures by vendor or template.
- •
Guardrails
- •Block processing of unsupported document classes.
- •Redact account numbers, SSNs/NINs where possible before logging prompts or outputs.
- •Require human approval for high-value payments or KYC exceptions.
Common Pitfalls
- •
Using free-form output instead of a schema
- •This leads to parsing bugs and silent failures.
- •Fix it by using
with_structured_output()or a strict parser likePydanticOutputParser.
- •
Skipping post-extraction validation
- •LLMs are good at extraction but not at accounting controls.
- •Always compare totals, check required fields, and enforce allowed value sets.
- •
Logging sensitive raw documents
- •Debug logs often become compliance incidents.
- •Redact PII/PCI data before logging and keep audit logs separate from application logs.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit