How to Build a document extraction Agent Using LangGraph in Python for investment banking
A document extraction agent for investment banking reads deal documents, pulls out structured fields, validates them, and hands the results to downstream systems with an audit trail. That matters because bankers deal with dense PDFs, scanned docs, term sheets, offering memoranda, credit agreements, and KYC packs where a missed clause or wrong number can create compliance risk, delayed closings, or bad data in the deal pipeline.
Architecture
- •
Document ingestion layer
- •Accepts PDFs, images, and text files from secure storage or a DMS.
- •Normalizes file metadata like deal ID, client name, jurisdiction, and source system.
- •
OCR and text extraction node
- •Uses OCR for scans and direct text extraction for digital PDFs.
- •Preserves page boundaries and line references for auditability.
- •
Extraction node
- •Calls an LLM to extract target fields such as issuer name, facility size, maturity date, covenants, fees, and governing law.
- •Outputs strict JSON that matches a schema.
- •
Validation and policy node
- •Checks required fields, date formats, currency normalization, and confidence thresholds.
- •Flags sensitive or restricted content based on internal policy.
- •
Human review routing
- •Sends low-confidence or high-risk documents to an analyst queue.
- •Keeps the machine output and reviewer edits side by side.
- •
Persistence and audit layer
- •Stores raw input hashes, model outputs, validation errors, and final approved records.
- •Supports replay for model governance and regulatory review.
Implementation
1) Define the extraction schema and graph state
Use TypedDict for state shape and Pydantic for output validation. In investment banking, this is where you lock down the fields you actually want downstream systems to trust.
from __future__ import annotations
from typing import TypedDict, Optional
from pydantic import BaseModel, Field
from langgraph.graph import StateGraph, START, END
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
class DealTerms(BaseModel):
deal_name: str = Field(description="Name of the transaction")
issuer: str
currency: str
facility_size: Optional[float] = None
maturity_date: Optional[str] = None
governing_law: Optional[str] = None
confidence: float = Field(ge=0.0, le=1.0)
class AgentState(TypedDict):
document_text: str
extracted: Optional[dict]
validated: bool
needs_review: bool
notes: list[str]
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
2) Build the extraction and validation nodes
The key pattern is simple: one node extracts structured data with with_structured_output, another node validates it against business rules. For banking workflows, keep validation separate from extraction so you can explain failures cleanly.
def extract_terms(state: AgentState) -> dict:
prompt = (
"Extract deal terms from this investment banking document. "
"Return only fields that are present or strongly implied. "
f"Document:\n{state['document_text']}"
)
structured_llm = llm.with_structured_output(DealTerms)
result = structured_llm.invoke([HumanMessage(content=prompt)])
return {
"extracted": result.model_dump(),
"notes": state["notes"] + ["Extraction completed"],
"validated": False,
"needs_review": False,
}
def validate_terms(state: AgentState) -> dict:
extracted = state["extracted"] or {}
notes = list(state["notes"])
needs_review = False
required_fields = ["deal_name", "issuer", "currency", "confidence"]
missing = [f for f in required_fields if not extracted.get(f)]
if missing:
notes.append(f"Missing required fields: {missing}")
needs_review = True
confidence = extracted.get("confidence", 0.0)
if confidence < 0.85:
notes.append(f"Low confidence score: {confidence}")
needs_review = True
if extracted.get("currency") not in {"USD", "EUR", "GBP"}:
notes.append("Unsupported or unnormalized currency")
needs_review = True
return {
"validated": not needs_review,
"needs_review": needs_review,
"notes": notes,
}
3) Add routing for human review using LangGraph conditional edges
This is where LangGraph earns its keep. You route documents with weak signals to a reviewer path instead of forcing every case through automation.
def route_decision(state: AgentState) -> str:
return "human_review" if state["needs_review"] else "done"
def human_review(state: AgentState) -> dict:
# Replace this with your analyst queue integration.
# In production this could create a task in ServiceNow or Jira.
notes = list(state["notes"])
notes.append("Routed to human review")
return {"notes": notes}
graph = StateGraph(AgentState)
graph.add_node("extract_terms", extract_terms)
graph.add_node("validate_terms", validate_terms)
graph.add_node("human_review", human_review)
graph.add_edge(START, "extract_terms")
graph.add_edge("extract_terms", "validate_terms")
graph.add_conditional_edges(
"validate_terms",
route_decision,
{
"human_review": "human_review",
"done": END,
},
)
graph.add_edge("human_review", END)
app = graph.compile()
4) Invoke the agent with document text and inspect the result
In production you would feed OCR output or parsed PDF text here. Keep the raw text immutable so you can reproduce every extraction later.
sample_doc = """
Issuer: Northstar Capital Holdings Ltd.
Deal Name: Northstar Term Loan B Refinancing
Currency: USD
Facility Size: $750 million
Maturity Date: June 30, 2031
Governing Law: New York
"""
result = app.invoke(
{
"document_text": sample_doc,
"extracted": None,
"validated": False,
"needs_review": False,
"notes": [],
}
)
print(result["extracted"])
print(result["validated"])
print(result["needs_review"])
print(result["notes"])
Production Considerations
- •
Data residency
- •Keep document processing inside approved regions.
- •If your bank requires EU-only or US-only handling, pin model endpoints and storage accordingly.
- •
Auditability
- •Persist input hashes, prompt versions, model version IDs, extracted JSON, validation results, and reviewer overrides.
- •For regulated workflows you need replayable traces per document version.
- •
Guardrails
- •Block extraction from documents that contain restricted client data unless the user/session has the right entitlements.
- •Add schema validation plus rule-based checks before any record reaches CRM or deal systems.
- •
Monitoring
- •Track extraction accuracy by document type: credit agreements behave differently from pitchbooks or offering memoranda.
- •Watch drift in confidence scores and reviewer override rates; those are early signs your prompts or OCR pipeline are degrading.
Common Pitfalls
- •
Treating OCR as solved
- •Scanned signatures pages and low-quality PDFs will break naive extraction.
- •Use OCR confidence scores and page-level fallbacks before calling the LLM.
- •
Skipping deterministic validation
- •An LLM can infer a plausible maturity date that is wrong.
- •Validate dates, currencies, numeric ranges, and jurisdiction against hard rules before persisting anything.
- •
No separation between machine output and final record
- •If you overwrite source-of-truth fields directly from the model output, you lose governance.
- •Store raw extraction separately from reviewed canonical data so compliance teams can trace every change.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit