How to Build a document extraction Agent Using LangGraph in Python for wealth management

By Cyprian AaronsUpdated 2026-04-21

document-extractionlanggraphpythonwealth-management

A document extraction agent for wealth management takes client PDFs, statements, KYC packs, trust deeds, and onboarding forms, then turns them into structured data you can validate, route, and store. It matters because the business cost is not just manual review time; it is compliance risk, slow account opening, and inconsistent downstream data that breaks suitability checks and reporting.

Architecture

A production-grade agent for this use case needs these components:

•
Document ingestion layer
- •Accepts PDFs, scans, images, and email attachments.
- •Normalizes file types before extraction.
•
OCR / text extraction stage
- •Pulls text from scanned statements and forms.
- •Preserves page boundaries and source offsets for auditability.
•
LLM extraction node
- •Converts raw text into a strict schema such as ClientProfile, BeneficialOwner, or AccountDetails.
- •Uses deterministic prompts and structured outputs.
•
Validation and policy layer
- •Checks required fields, date formats, jurisdiction rules, and completeness.
- •Flags missing tax IDs, expired IDs, or suspicious mismatches.
•
Human review handoff
- •Routes low-confidence or policy-failing documents to an operations queue.
- •Keeps an audit trail of what the model extracted versus what was approved.
•
Persistence and audit store
- •Stores extracted JSON, source document references, model version, timestamps, and reviewer decisions.
- •Supports retention and data residency requirements.

Implementation

1. Define the state your graph will move through

For wealth management, your state should carry both the raw document content and the structured output. Keep the original text around so compliance teams can trace every extracted field back to source evidence.

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages

class ExtractionState(TypedDict):
    document_id: str
    file_name: str
    raw_text: str
    extracted: dict
    validation_errors: list[str]
    needs_review: bool
    messages: Annotated[list, add_messages]

2. Build the extraction and validation nodes

Use a strict parser pattern. In production you would call an LLM with structured output; here the node shape is what matters. The important part is that each node returns only the fields it owns.

def extract_fields(state: ExtractionState) -> dict:
    text = state["raw_text"]

    # Replace this with an actual LLM call using structured output.
    extracted = {
        "client_name": "Jane Doe",
        "account_number": "ACCT-12345",
        "tax_residency": "US",
        "id_expiry": "2027-08-31",
    }

    return {"extracted": extracted}

def validate_fields(state: ExtractionState) -> dict:
    errors = []
    data = state.get("extracted", {})

    required = ["client_name", "account_number", "tax_residency", "id_expiry"]
    for field in required:
        if not data.get(field):
            errors.append(f"missing:{field}")

    needs_review = len(errors) > 0

    return {
        "validation_errors": errors,
        "needs_review": needs_review,
    }

3. Add routing for human review versus automatic completion

LangGraph’s StateGraph is a good fit because you can branch based on validation results. That gives you a clean separation between straight-through processing and exception handling.

def route_document(state: ExtractionState) -> str:
    if state["needs_review"]:
        return "review"
    return "complete"

def human_review_node(state: ExtractionState) -> dict:
    # In production this would create a task in your ops queue.
    return {
        "messages": [
            ("assistant", f"Document {state['document_id']} routed for manual review.")
        ]
    }

graph = StateGraph(ExtractionState)
graph.add_node("extract_fields", extract_fields)
graph.add_node("validate_fields", validate_fields)
graph.add_node("human_review", human_review_node)

graph.add_edge(START, "extract_fields")
graph.add_edge("extract_fields", "validate_fields")
graph.add_conditional_edges(
    "validate_fields",
    route_document,
    {
        "review": "human_review",
        "complete": END,
    },
)
graph.add_edge("human_review", END)

app = graph.compile()

4. Run the graph with real document input

This is the execution pattern you want in your service layer. The graph receives raw document text from OCR or PDF parsing upstream, then emits validated extraction results that your application can persist.

input_state = {
    "document_id": "doc_001",
    "file_name": "client_onboarding.pdf",
    "raw_text": """
        Client Name: Jane Doe
        Account Number: ACCT-12345
        Tax Residency: US
        ID Expiry: 2027-08-31
    """,
    "extracted": {},
    "validation_errors": [],
    "needs_review": False,
    "messages": [],
}

result = app.invoke(input_state)
print(result["extracted"])
print(result["validation_errors"])
print(result["needs_review"])

Production Considerations

•
Auditability
- •Persist every run with document_id, model version, prompt version, extracted JSON, validation errors, and reviewer action.
- •Wealth management teams need defensible records for onboarding decisions and regulatory reviews.
•
Data residency
- •Keep client documents in-region if your jurisdiction requires it.
- •If you use external model APIs, ensure contract terms cover storage location, retention windows, and subprocessor disclosure.
•
Guardrails
- •Reject extraction output that does not match schema constraints.
- •Add deterministic validation for dates, country codes, tax IDs, account numbers, and beneficial ownership thresholds.
•
Monitoring
- •Track extraction accuracy by document type: statements vs. KYC forms vs. trust documents.
- •Alert on spikes in manual review rates because that usually means OCR degradation or prompt drift.

Common Pitfalls

•
Treating OCR text as truth
- •Scans are messy. Page headers get duplicated, tables collapse badly, and footers pollute field values.
- •Fix it by preserving source offsets and validating against document-type-specific rules before persistence.
•
Skipping structured validation
- •If you let the model emit free-form JSON without checks, bad data will land in CRM or portfolio systems.
- •Fix it with strict schema validation plus routing to human review when required fields are missing or inconsistent.
•
Ignoring compliance metadata
- •Many teams store only the extracted fields and lose the evidence trail.
- •Fix it by storing source document hashes, timestamps, reviewer identity, model version, and region of processing alongside the payload.
•
Overloading one graph with every document type
- •A trust deed has different rules from a monthly statement or onboarding form.
- •Fix it by using separate graphs or subgraphs per document class so prompts and validators stay tight.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit