How to Build a document extraction Agent Using CrewAI in Python for retail banking

By Cyprian AaronsUpdated 2026-04-21
document-extractioncrewaipythonretail-banking

A document extraction agent for retail banking reads customer documents, pulls out the fields your downstream systems need, and returns structured output you can validate and store. In practice, that means extracting names, account numbers, income, addresses, IDs, and dates from PDFs or images without forcing ops teams to key everything manually.

For retail banking, this matters because document intake is a bottleneck in onboarding, loan applications, disputes, and KYC refreshes. If you do this wrong, you create compliance risk, bad data in core systems, and a slow customer experience.

Architecture

A production-grade CrewAI document extraction agent for retail banking usually needs these components:

  • Document ingestion layer

    • Accept PDFs, scans, and image files from secure storage or an internal upload service.
    • Normalize file paths and metadata before handing them to the agent.
  • Extraction agent

    • Uses an LLM to identify document type and extract structured fields.
    • Should be constrained to a schema so output is predictable.
  • Validation layer

    • Checks required fields, formats, confidence thresholds, and business rules.
    • Flags mismatches like invalid account numbers or expired IDs.
  • Audit logging

    • Stores prompt version, model response, extraction timestamp, and document hash.
    • Critical for bank auditability and dispute resolution.
  • Human review fallback

    • Routes low-confidence or high-risk cases to operations staff.
    • Prevents silent failures on KYC or lending documents.
  • Secure storage and policy controls

    • Encrypt documents at rest and in transit.
    • Enforce residency rules if documents cannot leave a specific region.

Implementation

1. Install CrewAI and define the extraction schema

Keep the schema strict. In banking, “best effort” JSON is not enough because downstream systems need stable keys.

pip install crewai pydantic
from pydantic import BaseModel, Field
from typing import Optional

class BankDocumentExtraction(BaseModel):
    document_type: str = Field(..., description="Type of banking document")
    full_name: Optional[str] = Field(None, description="Customer legal name")
    date_of_birth: Optional[str] = Field(None, description="DOB in YYYY-MM-DD")
    account_number: Optional[str] = Field(None, description="Masked or extracted account number")
    address: Optional[str] = Field(None, description="Customer address")
    issue_date: Optional[str] = Field(None, description="Document issue date")
    expiry_date: Optional[str] = Field(None, description="Document expiry date")
    confidence_notes: Optional[str] = Field(None, description="Any ambiguity or missing data")

2. Create a CrewAI agent focused on extraction only

Do not let the agent “reason” beyond the document. Its job is extraction and minimal normalization.

from crewai import Agent

document_extractor = Agent(
    role="Retail Banking Document Extractor",
    goal=(
        "Extract structured banking fields from customer documents with high accuracy "
        "and return only validated JSON-like content."
    ),
    backstory=(
        "You work in a retail banking operations team handling onboarding "
        "and KYC documents. You must be precise, conservative, and flag ambiguity."
    ),
    verbose=True,
)

3. Define a task that binds the output format

Use Task with a clear expected output. In regulated workflows, explicit instructions reduce drift.

from crewai import Task

extract_task = Task(
    description=(
        "Extract the relevant fields from the provided retail banking document. "
        "Identify the document type first. Return only fields that are visible in the document. "
        "If a field is missing or unclear, set it to null and explain why in confidence_notes."
    ),
    expected_output=(
        "A structured extraction matching the BankDocumentExtraction schema "
        "with no extra commentary."
    ),
    agent=document_extractor,
)

4. Run the crew against a document input

CrewAI’s Crew orchestrates the agent-task pair. In a real system you would pass text from OCR or a parsed PDF pipeline into the task context.

from crewai import Crew
import json

def extract_document(document_text: str):
    crew = Crew(
        agents=[document_extractor],
        tasks=[extract_task],
        verbose=True,
    )

    result = crew.kickoff(inputs={"document_text": document_text})

    return result

if __name__ == "__main__":
    sample_text = """
    First National Bank
    Customer Name: Amina Patel
    Date of Birth: 1991-08-14
    Account Number: 1234567890
    Address: 14 Cedar Road, Johannesburg
    Issue Date: 2024-02-10
    Expiry Date: 2029-02-10
    """

    output = extract_document(sample_text)
    print(output)

A practical production flow is:

  1. OCR or text extraction service converts PDF/image to text.
  2. CrewAI agent extracts candidate fields.
  3. Validation code checks required fields and formats.
  4. Low-confidence cases go to manual review.

Production Considerations

  • Compliance controls

    • Never send unredacted sensitive data to a model unless your policy allows it.
    • Mask account numbers where possible and log only what audit requires.
  • Data residency

    • Keep OCR text and LLM processing inside approved regions.
    • If your bank requires local processing, use regional infrastructure and approved model endpoints only.
  • Auditability

    • Persist input document hash, prompt version, model name, task ID, extracted output, and reviewer actions.
    • This gives you traceability for KYC disputes and internal audits.
  • Monitoring

    • Track extraction accuracy by document type.
    • Alert on spikes in null values for critical fields like name or ID number; that usually means OCR quality dropped or prompts regressed.

Common Pitfalls

  • Letting the agent infer missing data

    • Bad pattern: filling gaps from context that is not actually on the page.
    • Fix: require null for missing values and send uncertain cases to review.
  • Skipping validation after extraction

    • Bad pattern: trusting LLM output directly.
    • Fix: validate dates, ID formats, account number length, and required fields before writing to core systems.
  • Ignoring OCR quality

    • Bad pattern: blaming the model when scans are blurry or skewed.
    • Fix: preprocess images with deskewing/denoising and reject unreadable inputs early.

If you build this as an extraction-and-validation pipeline instead of a single prompt call into an LLM toolchain that can survive retail banking requirements without becoming an operational liability.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides