How to Build a document extraction Agent Using LangChain in Python for investment banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionlangchainpythoninvestment-banking

A document extraction agent for investment banking reads deal docs, extracts structured fields, and returns machine-usable output with traceable evidence. That matters because analysts spend too much time pulling terms from pitch books, CIMs, credit agreements, term sheets, and KYC packs, and every manual pass introduces risk in compliance, valuation, and downstream reporting.

Architecture

  • Document ingestion layer

    • Pull PDFs, DOCX, emails, and scans from approved sources like SharePoint, S3, or an internal DMS.
    • Normalize filenames, metadata, deal IDs, and retention tags before processing.
  • Text extraction layer

    • Use PyPDFLoader, UnstructuredPDFLoader, or OCR-backed preprocessing for scanned files.
    • Preserve page numbers and section boundaries so extracted fields can be traced back.
  • LLM extraction chain

    • Use LangChain’s ChatPromptTemplate, PydanticOutputParser, and a chat model like ChatOpenAI.
    • Force structured output for fields such as issuer name, facility amount, maturity date, covenants, jurisdiction, and key risks.
  • Validation and guardrail layer

    • Validate outputs with Pydantic models.
    • Reject missing critical fields or low-confidence parses before sending results downstream.
  • Audit and persistence layer

    • Store source document hash, extracted JSON, prompt version, model version, timestamp, and page references.
    • Keep this immutable for compliance review and model governance.

Implementation

1) Install the right packages

Use LangChain’s current split packages. For a production service you want explicit dependencies instead of a monolith.

pip install langchain langchain-openai langchain-community pydantic pypdf

2) Define the extraction schema

For investment banking, don’t ask the model for “summary.” Ask for exact fields your workflow needs. Pydantic gives you validation before anything lands in a deal system.

from typing import List, Optional
from pydantic import BaseModel, Field

class DealExtraction(BaseModel):
    issuer_name: str = Field(description="Legal name of the issuer or borrower")
    document_type: str = Field(description="Type of document such as term sheet or credit agreement")
    facility_amount_usd: Optional[float] = Field(default=None, description="Principal amount in USD")
    maturity_date: Optional[str] = Field(default=None, description="Maturity date in ISO format if available")
    governing_law: Optional[str] = Field(default=None)
    key_covenants: List[str] = Field(default_factory=list)
    page_references: List[int] = Field(default_factory=list)

3) Build the LangChain extraction chain

This pattern uses PyPDFLoader to load pages with metadata, ChatPromptTemplate to constrain the task, and PydanticOutputParser to enforce structure. The model is asked to return only what it can support from the text.

from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import PydanticOutputParser

parser = PydanticOutputParser(pydantic_object=DealExtraction)

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You extract structured data from investment banking documents. "
     "Only use facts present in the provided text. "
     "If a field is not present, leave it null or empty."),
    ("human",
     "Extract the required fields from this document text.\n\n"
     "{format_instructions}\n\n"
     "Document text:\n{text}")
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

loader = PyPDFLoader("credit_agreement.pdf")
pages = loader.load()

text = "\n\n".join(
    f"[Page {doc.metadata.get('page', 'unknown') + 1}] {doc.page_content}"
    for doc in pages
)

chain = prompt | llm | parser

result: DealExtraction = chain.invoke({
    "text": text,
    "format_instructions": parser.get_format_instructions()
})

print(result.model_dump())

4) Add page-level provenance for auditability

In banking you need traceability. If an analyst asks where “maturity date” came from, you should point to a page number and ideally a text span. A simple first pass is page-level provenance.

def extract_with_provenance(pdf_path: str) -> dict:
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()

    page_texts = []
    for doc in docs:
        page_num = doc.metadata.get("page", 0) + 1
        page_texts.append(f"[Page {page_num}] {doc.page_content}")

    full_text = "\n\n".join(page_texts)
    parsed: DealExtraction = chain.invoke({
        "text": full_text,
        "format_instructions": parser.get_format_instructions()
    })

    return {
        "document": pdf_path,
        "extracted": parsed.model_dump(),
        "source_pages": [d.metadata.get("page", 0) + 1 for d in docs],
        "model": llm.model_name,
    }

Production Considerations

  • Data residency

    • Keep documents in-region if your bank requires it.
    • Use a model endpoint that satisfies your legal entity’s residency rules and vendor approval process.
  • Compliance logging

    • Store prompt template version, model version, input hash, output JSON, and human override events.
    • This gives audit teams something usable when they review extraction decisions.
  • Guardrails

    • Block extraction on sensitive classes of documents unless they are explicitly approved.
    • Validate against a strict schema and fail closed if required fields are missing or inconsistent.
  • Operational monitoring

    • Track parse success rate, null-field rate by document type, latency per page count, and manual correction rate.
    • Spikes usually mean template drift or new document formats from a new counterparty.

Common Pitfalls

  1. Using free-form summaries instead of schemas

    • This creates output that looks good but breaks downstream systems.
    • Fix it by using PydanticOutputParser or .with_structured_output() on supported chat models.
  2. Ignoring source traceability

    • If you cannot show where a field came from, compliance will reject the workflow.
    • Fix it by preserving page metadata from loaders like PyPDFLoader and storing immutable extraction logs.
  3. Sending raw scans straight to the LLM

    • OCR errors will produce bad fields on signed term sheets and scanned exhibits.
    • Fix it by running OCR first when needed and segmenting documents by section or page before extraction.
  4. Treating every document the same

    • A credit agreement is not a CIM is not a KYC form.
    • Fix it by routing documents through type-specific prompts and schemas so each extractor targets the right fields.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides