How to Build a document extraction Agent Using LangChain in Python for banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionlangchainpythonbanking

A document extraction agent for banking reads PDFs, scans, images, and email attachments, then turns them into structured fields your downstream systems can trust. In practice, that means pulling out things like account numbers, transaction dates, beneficiary names, loan terms, and signature presence without forcing ops teams to manually rekey data.

For banking, this matters because document intake is a control point. If extraction is wrong, you get bad onboarding decisions, failed KYC checks, compliance gaps, and audit headaches.

Architecture

  • Document loader

    • Ingests PDFs and image-based files from secure storage.
    • Use PyPDFLoader for text PDFs and OCR for scanned documents.
  • Text normalization layer

    • Cleans page breaks, headers, footers, and noisy OCR output.
    • Keeps the prompt focused on the actual content.
  • Extraction chain

    • Uses ChatOpenAI through LangChain to map unstructured text into a fixed schema.
    • Prefer structured outputs over free-form JSON guessing.
  • Schema validator

    • Enforces field types, required fields, and allowed values.
    • Rejects partial or malformed outputs before they hit core banking systems.
  • Audit logger

    • Stores input document hash, model version, prompt version, extracted fields, and confidence metadata.
    • Required for traceability in regulated environments.
  • Human review queue

    • Routes low-confidence or policy-sensitive cases to operations staff.
    • Critical for KYC/AML workflows where false positives and false negatives both matter.

Implementation

1) Install the right packages

Use LangChain’s current split packages. For a simple extraction workflow you need the core library plus an OpenAI chat model integration.

pip install langchain langchain-openai pydantic pypdf

Set your model key in the environment:

export OPENAI_API_KEY="your-key"

2) Define the banking schema

Do not extract into a loose dict. Define the exact contract you want downstream systems to consume.

from typing import Optional
from pydantic import BaseModel, Field

class BankDocumentExtraction(BaseModel):
    document_type: str = Field(description="Type of banking document")
    customer_name: Optional[str] = Field(default=None)
    account_number: Optional[str] = Field(default=None)
    iban: Optional[str] = Field(default=None)
    swift_code: Optional[str] = Field(default=None)
    amount: Optional[float] = Field(default=None)
    currency: Optional[str] = Field(default=None)
    issue_date: Optional[str] = Field(default=None)
    due_date: Optional[str] = Field(default=None)
    has_signature: Optional[bool] = Field(default=None)

This schema gives you validation before anything is written to a case management system or data warehouse.

3) Load the document and build the extraction chain

Here is the actual LangChain pattern using PyPDFLoader, ChatOpenAI, and with_structured_output.

from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.prompts import ChatPromptTemplate

loader = PyPDFLoader("bank_statement.pdf")
pages = loader.load()

document_text = "\n\n".join(page.page_content for page in pages)

prompt = ChatPromptTemplate.from_messages([
    ("system", 
     "You extract structured data from banking documents. "
     "Return only fields supported by the schema. "
     "If a field is missing or unclear, leave it null."),
    ("user", 
     "Extract key fields from this document:\n\n{document_text}")
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

structured_llm = llm.with_structured_output(BankDocumentExtraction)

chain = prompt | structured_llm

result = chain.invoke({"document_text": document_text})

print(result.model_dump())

A few things matter here:

  • temperature=0 keeps output stable.
  • with_structured_output() forces the model into your Pydantic schema.
  • The prompt explicitly tells the model to return null when unsure instead of inventing values.

4) Add validation and routing logic

Banking workflows should not trust every extraction equally. Add a simple review gate for missing critical fields.

def needs_human_review(extraction: BankDocumentExtraction) -> bool:
    critical_missing = [
        extraction.document_type is None,
        extraction.customer_name is None,
        extraction.account_number is None,
    ]
    return any(critical_missing)

extraction = chain.invoke({"document_text": document_text})

if needs_human_review(extraction):
    print("Route to manual review")
else:
    print("Accept extraction:", extraction.model_dump())

In production you would replace print() with an event to your workflow engine or case management queue.

Production Considerations

  • Control data residency

    • Banking documents often contain regulated personal and financial data.
    • Keep processing in approved regions and verify your model provider’s regional hosting options before deployment.
  • Log for auditability

    • Store document checksum, source location, prompt version, model name, schema version, and final output.
    • Auditors care about reproducibility more than clever prompting.
  • Add confidence-based routing

    • Route documents with missing critical fields or inconsistent values to human review.
    • This reduces silent failure on statements, invoices, proof-of-income docs, and onboarding forms.
  • Apply redaction before model calls

    • Mask unnecessary sensitive fields like full account numbers where possible.
    • Minimize exposure of PII while still preserving enough context for extraction.

Common Pitfalls

  1. Using free-form text output instead of structured output

    • Problem: The model returns JSON-like text that breaks parsers.
    • Fix: Use with_structured_output() with a Pydantic schema so LangChain validates the response shape.
  2. Trying to extract from noisy OCR without cleanup

    • Problem: Scanned documents produce garbage tokens that confuse the model.
    • Fix: Normalize text first. Remove repeated headers/footers and chunk long documents by page when needed.
  3. Skipping bank-specific guardrails

    • Problem: A valid-looking answer can still be non-compliant if it was generated from unauthorized data or stored in the wrong region.
    • Fix: Add policy checks around residency, retention, access control, and audit logging before any production rollout.

A good banking extraction agent is not just an LLM call. It is a controlled pipeline that ingests documents safely, extracts only what you asked for, validates the result against a schema, and sends uncertain cases to humans instead of guessing.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides