How to Build a document extraction Agent Using LangGraph in Python for fintech
A document extraction agent for fintech takes PDFs, scans, statements, invoices, KYC forms, and loan docs, then turns them into structured data your systems can trust. It matters because manual ops teams are slow, expensive, and inconsistent, while bad extraction creates compliance risk, broken underwriting flows, and audit headaches.
Architecture
A production-grade LangGraph document extraction agent for fintech usually needs these components:
- •
Document intake node
- •Accepts files from S3, blob storage, or an internal upload service.
- •Normalizes MIME type, page count, and file metadata.
- •
OCR / text extraction node
- •Handles scanned PDFs and image-based documents.
- •Uses OCR when embedded text is missing or low quality.
- •
Field extraction node
- •Pulls structured fields like name, account number, invoice total, dates, and entity identifiers.
- •Returns JSON that matches a strict schema.
- •
Validation and policy node
- •Checks required fields, format rules, checksum-like constraints, and business logic.
- •Flags PII leakage, suspicious values, and incomplete documents.
- •
Human review escalation node
- •Routes low-confidence or policy-violating cases to an analyst queue.
- •Preserves the original document plus model output for audit.
- •
Persistence / audit node
- •Writes final results to a database or event stream.
- •Stores model version, prompt version, timestamps, and decision trace.
Implementation
1) Define the state and graph nodes
Use TypedDict for graph state so every step has a clear contract. In fintech workflows, the state should carry raw text, extracted fields, validation results, and an escalation flag.
from typing import TypedDict, Optional
from langgraph.graph import StateGraph, START
from langgraph.graph.message import add_messages
from langchain_core.messages import HumanMessage
import re
class ExtractionState(TypedDict):
document_path: str
raw_text: str
extracted_json: dict
validation_errors: list[str]
needs_review: bool
final_status: str
def load_document(state: ExtractionState) -> ExtractionState:
# Replace with real PDF/OCR pipeline
with open(state["document_path"], "r", encoding="utf-8") as f:
text = f.read()
return {"raw_text": text}
def extract_fields(state: ExtractionState) -> ExtractionState:
text = state["raw_text"]
invoice_id = re.search(r"Invoice\s+#?(\w+)", text)
total = re.search(r"Total[:\s]+\$?([0-9,.]+)", text)
extracted = {
"invoice_id": invoice_id.group(1) if invoice_id else None,
"total": total.group(1) if total else None,
}
return {"extracted_json": extracted}
def validate_fields(state: ExtractionState) -> ExtractionState:
errors = []
data = state["extracted_json"]
if not data.get("invoice_id"):
errors.append("missing_invoice_id")
if not data.get("total"):
errors.append("missing_total")
return {
"validation_errors": errors,
"needs_review": len(errors) > 0,
"final_status": "review" if errors else "approved",
}
graph = StateGraph(ExtractionState)
graph.add_node("load_document", load_document)
graph.add_node("extract_fields", extract_fields)
graph.add_node("validate_fields", validate_fields)
graph.add_edge(START, "load_document")
graph.add_edge("load_document", "extract_fields")
graph.add_edge("extract_fields", "validate_fields")
app = graph.compile()
2) Add routing for review vs auto-approve
LangGraph’s add_conditional_edges() is the right pattern when you need deterministic branching. For fintech documents you should never auto-approve low-confidence output just because the model produced something plausible.
from langgraph.graph import END
def route_after_validation(state: ExtractionState):
return "review" if state["needs_review"] else "done"
def human_review(state: ExtractionState) -> ExtractionState:
# Replace with a real queue/task system.
print("Send to analyst:", state["validation_errors"])
return {"final_status": "manual_review"}
workflow = StateGraph(ExtractionState)
workflow.add_node("load_document", load_document)
workflow.add_node("extract_fields", extract_fields)
workflow.add_node("validate_fields", validate_fields)
workflow.add_node("human_review", human_review)
workflow.add_edge(START, "load_document")
workflow.add_edge("load_document", "extract_fields")
workflow.add_edge("extract_fields", "validate_fields")
workflow.add_conditional_edges(
"validate_fields",
route_after_validation,
{
"review": "human_review",
"done": END,
},
)
workflow.add_edge("human_review", END)
app = workflow.compile()
3) Run the graph with a real input payload
The compiled graph exposes invoke(). In production you’d pass a storage reference instead of local paths.
result = app.invoke(
{
"document_path": "./sample_invoice.txt",
"raw_text": "",
"extracted_json": {},
"validation_errors": [],
"needs_review": False,
"final_status": "",
}
)
print(result["final_status"])
print(result["extracted_json"])
print(result["validation_errors"])
4) Extend with LLM-backed extraction when regex is not enough
Regex works for clean invoices. For KYC forms or bank statements you’ll want an LLM node that returns strict JSON. The key is to keep the model inside a bounded step and validate everything after it.
from langchain_openai import ChatOpenAI
from pydantic import BaseModel
class InvoiceSchema(BaseModel):
invoice_id: Optional[str] = None
total: Optional[str] = None
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def llm_extract(state: ExtractionState) -> ExtractionState:
prompt = (
f"Extract invoice_id and total from this document.\n\n"
f"{state['raw_text']}\n\n"
f"Return only JSON."
)
response = llm.invoke(prompt)
data = InvoiceSchema.model_validate_json(response.content).model_dump()
return {"extracted_json": data}
Production Considerations
- •
Auditability
- •Store every run with document hash, model version, prompt version, validator version, and reviewer identity.
- •Fintech auditors will ask how a field was derived. Keep the full trace.
- •
Data residency
- •Route documents by region before they hit any external model endpoint.
- •If customer data must stay in-country, deploy OCR and inference inside your VPC or approved region.
- •
Guardrails
- •Validate schema completeness before downstream write operations.
- •Reject unsupported document types early.
- •Mask PII in logs; never print raw account numbers or national IDs.
- •
Monitoring
- •Track extraction accuracy by field type, review rate, OCR failure rate, and latency per doc class.
- •Alert on drift when a vendor changes statement layouts or scan quality drops.
Common Pitfalls
- •
Using the LLM as the source of truth
- •Don’t trust raw model output directly.
- •Always validate against a schema and business rules before persisting anything.
- •
Skipping conditional routing
- •If every doc follows the same path, bad extractions flow straight into finance systems.
- •Use
add_conditional_edges()so uncertain cases go to review.
- •
Ignoring document provenance
- •If you don’t store source path, hash, timestamp, and processing version, audits become painful.
- •Persist lineage with every extracted record.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit