How to Build a document extraction Agent Using CrewAI in Python for banking

By Cyprian AaronsUpdated 2026-04-21
document-extractioncrewaipythonbanking

A document extraction agent for banking reads PDFs, scans, emails, and uploaded forms, then turns them into structured data your downstream systems can trust. In practice, that means extracting fields like customer name, account number, loan amount, income, and dates while preserving evidence for audit, compliance checks, and human review when confidence is low.

Architecture

Build this agent with a narrow scope. For banking, the goal is not “understand everything,” it is “extract the right fields with traceability.”

  • Document intake layer

    • Accepts PDFs, images, and text attachments from secure storage or an internal queue.
    • Enforces file type checks, size limits, and virus scanning before processing.
  • OCR / text normalization layer

    • Converts scanned documents into clean text.
    • Handles page ordering, rotation correction, and removal of boilerplate noise.
  • Extraction agent

    • Uses CrewAI Agent with a tight role: extract specified fields only.
    • Returns structured JSON aligned to a schema.
  • Validation agent

    • Checks extracted values against banking rules.
    • Flags invalid dates, mismatched totals, missing mandatory fields, or suspicious patterns.
  • Audit trail store

    • Persists raw input references, extracted output, validation results, timestamps, and model version.
    • Needed for compliance review and post-incident reconstruction.
  • Human-in-the-loop fallback

    • Routes low-confidence or high-risk documents to an operations queue.
    • Required for KYC, lending decisions, and exceptions handling.

Implementation

1) Install CrewAI and prepare your schema

Use a simple schema first. Banking extraction gets messy when you try to infer too much in one pass.

pip install crewai crewai-tools pydantic

Define the output shape you want from the agent. Keep it explicit so downstream systems do not guess.

from pydantic import BaseModel, Field
from typing import Optional

class BankDocumentFields(BaseModel):
    customer_name: str = Field(..., description="Full legal name")
    account_number: Optional[str] = Field(None, description="Bank account number")
    document_type: str = Field(..., description="Type of document")
    issue_date: Optional[str] = Field(None, description="ISO date if present")
    total_amount: Optional[float] = Field(None, description="Monetary amount if present")

2) Create the CrewAI agent and task

Use one agent for extraction and one for validation. That separation matters because extraction and policy checks are different jobs.

import os
from crewai import Agent, Task, Crew, Process
from crewai.llm import LLM

llm = LLM(
    model="gpt-4o-mini",
    api_key=os.environ["OPENAI_API_KEY"]
)

extractor = Agent(
    role="Bank Document Extractor",
    goal="Extract only the requested banking fields from documents accurately",
    backstory=(
        "You process bank statements, loan forms, and identity documents. "
        "You return strict structured data and never invent missing values."
    ),
    llm=llm,
    verbose=True,
)

validator = Agent(
    role="Banking Validation Analyst",
    goal="Validate extracted fields against banking rules and flag issues",
    backstory=(
        "You review extracted records for completeness, format errors,"
        "and compliance risks."
    ),
    llm=llm,
    verbose=True,
)

extract_task = Task(
    description=(
        "Extract customer_name, account_number, document_type, issue_date "
        "and total_amount from the provided bank document text. "
        "Return only valid JSON matching the schema."
    ),
    expected_output="A JSON object with extracted banking fields.",
    agent=extractor,
)

validate_task = Task(
    description=(
        "Review the extracted JSON for missing mandatory fields,"
        "date formatting issues, and suspicious values."
        "Return a short validation report."
    ),
    expected_output="A validation report with pass/fail status and reasons.",
    agent=validator,
)

3) Run the crew on normalized text

In production you would feed OCR output here. For a first pass you can test with plain text extracted from a PDF pipeline.

document_text = """
ACME BANK STATEMENT
Customer Name: Jane Doe
Account Number: 1234567890
Document Type: Statement
Issue Date: 2025-03-18
Total Amount: 4820.55
"""

crew = Crew(
    agents=[extractor, validator],
    tasks=[extract_task],
    process=Process.sequential,
    verbose=True,
)

result = crew.kickoff(inputs={"document_text": document_text})
print(result)

That pattern gives you an extraction result first. If you want validation to consume the extracted JSON in the same flow, run a second crew call using the first output as input to the validator task.

validation_crew = Crew(
    agents=[validator],
    tasks=[validate_task],
    process=Process.sequential,
)

validation_result = validation_crew.kickoff(inputs={"extracted_json": str(result)})
print(validation_result)

4) Add deterministic post-processing before storage

Do not write raw LLM output directly into core banking workflows. Parse it into your schema first and reject anything malformed.

import json

def normalize_result(raw_output: str) -> dict:
    data = json.loads(raw_output)
    parsed = BankDocumentFields(**data)
    return parsed.model_dump()

# Example:
# cleaned = normalize_result(str(result))
# store_to_audit_log(cleaned)

Production Considerations

  • Data residency

    • Keep OCR text and extracted outputs in-region if your bank has jurisdictional constraints.
    • If your policy forbids external processing of PII outside a region, route requests through approved infrastructure only.
  • Auditability

    • Store document hash, source location, prompt version, model version, timestamps, and final JSON.
    • You need this for regulator questions and internal incident reviews.
  • Guardrails

    • Reject outputs that do not match schema or contain fabricated values.
    • Add hard rules for mandatory fields like customer name or account number when required by workflow.
  • Monitoring

    • Track extraction accuracy by document type.
    • Alert on drift when OCR quality drops or when a new template starts producing more human escalations.

Common Pitfalls

  • Trying to extract everything in one prompt

    • This causes noisy outputs and hallucinated fields.
    • Fix it by defining a narrow schema per document type: statements separate from loan applications separate from IDs.
  • Skipping OCR quality checks

    • Bad OCR produces bad extraction even with a strong model.
    • Fix it by pre-validating page count, resolution, skew correction status, and confidence scores before calling CrewAI.
  • Not separating extraction from validation

    • One agent doing both jobs tends to miss policy issues while focusing on field recovery.
    • Fix it by using distinct Agent roles and passing structured output between them.
  • Writing outputs directly into downstream systems

    • That is how bad data gets into KYC queues or lending decisions.
    • Fix it by parsing into Pydantic models first and routing failures to manual review.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides