How to Build a document extraction Agent Using CrewAI in Python for fintech

By Cyprian AaronsUpdated 2026-04-21
document-extractioncrewaipythonfintech

A document extraction agent for fintech takes messy inputs like bank statements, invoices, KYC forms, trade confirmations, and loan applications, then turns them into structured, validated data your downstream systems can trust. That matters because in fintech, extraction is not just about speed; it’s about compliance, auditability, and reducing manual review without introducing bad data into underwriting, onboarding, reconciliation, or fraud workflows.

Architecture

  • Input ingestion layer

    • Accept PDFs, scanned images, and email attachments.
    • Normalize files before they hit the agent pipeline.
  • Document classification agent

    • Detect document type first: bank statement, payslip, invoice, ID document, etc.
    • Route to the right extraction schema.
  • Extraction agent

    • Pull structured fields from text and OCR output.
    • Return JSON aligned to a strict schema.
  • Validation agent

    • Check field completeness, date formats, totals, currency consistency, and confidence thresholds.
    • Flag anomalies for human review.
  • Audit logger

    • Store raw input hashes, extracted output, model version, timestamps, and reviewer actions.
    • Needed for compliance and traceability.
  • Human-in-the-loop review queue

    • Catch low-confidence or high-risk documents.
    • Keep an override path for regulated workflows.

Implementation

1) Install dependencies and define your schemas

Use CrewAI for orchestration and Pydantic for strict output validation. In fintech, do not let the model free-form its way through extraction; force structure.

from pydantic import BaseModel, Field
from typing import Optional
from crewai import Agent, Task, Crew

class BankStatementExtraction(BaseModel):
    account_holder: str = Field(..., description="Name of account holder")
    account_number_last4: str = Field(..., description="Last 4 digits only")
    statement_period_start: str = Field(..., description="YYYY-MM-DD")
    statement_period_end: str = Field(..., description="YYYY-MM-DD")
    opening_balance: float
    closing_balance: float
    currency: str = Field(..., description="ISO currency code")
    total_debits: Optional[float] = None
    total_credits: Optional[float] = None

2) Create specialized agents for classification, extraction, and validation

CrewAI works best when each agent has one job. For fintech documents that means separate responsibilities instead of one giant prompt doing everything.

classify_agent = Agent(
    role="Document Classifier",
    goal="Identify the document type and route it to the correct extraction logic",
    backstory="You classify financial documents with high precision.",
    verbose=True,
)

extract_agent = Agent(
    role="Financial Extraction Specialist",
    goal="Extract structured fields from financial documents into valid JSON",
    backstory="You extract bank statement fields accurately and conservatively.",
    verbose=True,
)

validate_agent = Agent(
    role="Compliance Validator",
    goal="Validate extracted fields against business rules and flag anomalies",
    backstory="You check extracted financial data for consistency and regulatory risk.",
    verbose=True,
)

3) Build tasks with explicit outputs

For production use you want clear task boundaries. The classifier decides the document family; the extractor fills the schema; the validator checks it.

classification_task = Task(
    description=(
        "Read the document text and identify whether it is a bank statement. "
        "Return only the document type."
    ),
    expected_output="A short label such as 'bank_statement' or 'unknown'.",
    agent=classify_agent,
)

extraction_task = Task(
    description=(
        "Extract the required bank statement fields from the provided document text. "
        "Return a JSON object matching the BankStatementExtraction schema."
    ),
    expected_output="Valid JSON with all required fields populated.",
    agent=extract_agent,
)

validation_task = Task(
    description=(
        "Check whether opening_balance + credits - debits equals closing_balance within tolerance. "
        "Flag missing values or suspicious inconsistencies."
    ),
    expected_output="A validation report with pass/fail status and reasons.",
    agent=validate_agent,
)

4) Run the crew and enforce post-processing

This is where you keep control. CrewAI orchestrates the work; your code enforces schema validity before anything lands in a core system.

def run_extraction(document_text: str):
    crew = Crew(
        agents=[classify_agent, extract_agent, validate_agent],
        tasks=[classification_task, extraction_task, validation_task],
        verbose=True,
    )

    result = crew.kickoff(inputs={"document_text": document_text})

    # In production you'd parse structured outputs from each task separately.
    # Here we treat kickoff output as the final artifact to persist after validation.
    return result


if __name__ == "__main__":
    sample_text = """
    ACME BANK STATEMENT
    Account Holder: Jane Doe
    Account Number: ****1234
    Period: 2024-01-01 to 2024-01-31
    Opening Balance: 1200.50 USD
    Closing Balance: 980.25 USD
    Total Debits: 450.25 USD
    Total Credits: 230.00 USD
    """

print(run_extraction(sample_text))

Production Considerations

  • Keep data residency explicit

    • If you process customer financial documents in-region only stores and model endpoints.
    • Do not send regulated PII across jurisdictions without legal review.
  • Log for auditability

    • Persist source document hash, extracted fields, confidence score proxy if available, timestamp, agent/task versions, and reviewer decisions.
    • This is what compliance teams will ask for after a dispute or model incident.
  • Add guardrails before downstream writes

    • Reject impossible dates, negative balances where not allowed by product policy, mismatched currencies, and truncated identifiers that violate policy.
    • Route low-confidence cases to manual review instead of auto-booking them.
  • Monitor extraction drift

    • Track field-level failure rates by issuer template or document source.
    • New bank statement layouts will break assumptions long before a generic accuracy metric shows it.

Common Pitfalls

  1. Using one agent for everything

    • A single prompt for classification plus extraction plus validation usually produces sloppy outputs.
    • Split responsibilities into separate agents and tasks so failures are easier to isolate.
  2. Skipping schema enforcement

    • Free-text outputs are not acceptable in fintech workflows.
    • Use Pydantic models or equivalent validation after generation so malformed JSON never reaches production systems.
  3. Ignoring human review thresholds

    • Auto-accepting every extraction is how bad data gets posted into onboarding or reconciliation systems.
    • Define clear escalation rules for missing fields, low confidence documents, inconsistent totals, or unsupported formats.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides