How to Build a document extraction Agent Using CrewAI in Python for lending

By Cyprian AaronsUpdated 2026-04-21

document-extractioncrewaipythonlending

A document extraction agent for lending reads borrower files, pulls out the fields your credit workflow needs, and turns messy PDFs into structured data you can score, verify, and route. That matters because lending teams live or die on turnaround time, auditability, and consistency across income statements, bank statements, IDs, tax returns, and supporting docs.

Architecture

For lending, this agent should be built as a small pipeline, not a single prompt.

•
Document intake layer
- •Accept PDFs, scans, and image files from an upload service or object store.
- •Normalize file paths and metadata like applicant ID, loan ID, and document type.
•
OCR / text extraction tool
- •Extract text from scanned documents before the LLM sees anything.
- •In production, this is usually a separate service or API-backed tool.
•
Extraction agent
- •Uses CrewAI Agent to read extracted text and return structured lending fields.
- •Should focus on one job: parse documents into a strict schema.
•
Validation step
- •Checks required fields like borrower name, employer name, income amount, statement date range, and account balance.
- •Flags missing or inconsistent values before downstream underwriting.
•
Audit trail store
- •Persist raw text, extracted JSON, model version, prompt version, and timestamps.
- •This is non-negotiable for compliance reviews and dispute handling.
•
Human review handoff
- •Route low-confidence or high-risk cases to an analyst.
- •Keep the agent out of final decisioning unless your policy explicitly allows it.

Implementation

1) Install CrewAI and define the extraction schema

Start with a strict output shape. Lending workflows break when the model improvises field names.

from pydantic import BaseModel, Field
from typing import Optional

class LendingDocumentExtraction(BaseModel):
    document_type: str = Field(..., description="Type of document such as pay_stub or bank_statement")
    borrower_name: Optional[str] = Field(None, description="Full legal name of borrower")
    employer_name: Optional[str] = Field(None, description="Employer name if present")
    gross_monthly_income: Optional[float] = Field(None, description="Monthly gross income in USD")
    statement_start_date: Optional[str] = Field(None, description="Statement start date in ISO format")
    statement_end_date: Optional[str] = Field(None, description="Statement end date in ISO format")
    ending_balance: Optional[float] = Field(None, description="Ending balance in USD")
    confidence_notes: Optional[str] = Field(None, description="Short explanation of uncertainty or missing data")

2) Create an extraction agent with a focused role

Use one agent for extraction. Do not mix extraction with underwriting logic.

from crewai import Agent

document_extractor = Agent(
    role="Lending Document Extraction Specialist",
    goal="Extract structured lending fields from OCR text with high accuracy",
    backstory=(
        "You process borrower documents for loan operations. "
        "You only extract facts present in the document and never invent values."
    ),
    verbose=True,
)

3) Define the task and run it through a Crew

CrewAI’s Task supports structured outputs via output_pydantic. That is the cleanest pattern here.

from crewai import Task, Crew
from crewai.llm import LLM

llm = LLM(model="gpt-4o-mini", temperature=0)

ocr_text = """
PAY STUB
Employee: Jordan Smith
Employer: Northwind Logistics LLC
Gross Pay This Period: $3,250.00
Gross YTD: $19,500.00
Pay Date: 2026-03-15
"""

task = Task(
    description=(
        "Extract lending-relevant fields from the OCR text below. "
        "Return only facts explicitly present in the text.\n\n"
        f"OCR TEXT:\n{ocr_text}"
    ),
    expected_output="A structured extraction of borrower identity and income details.",
    agent=document_extractor,
    output_pydantic=LendingDocumentExtraction,
)

crew = Crew(
    agents=[document_extractor],
    tasks=[task],
    verbose=True,
)

result = crew.kickoff()
print(result.pydantic.model_dump())

This gives you a typed payload you can validate before sending it to underwriting rules or LOS integration.

4) Add validation and routing for bad extractions

In lending, incomplete extractions should fail closed. If required fields are missing or confidence is low, send the case to review.

def needs_manual_review(extraction: LendingDocumentExtraction) -> bool:
    required_missing = [
        extraction.document_type is None,
        extraction.borrower_name is None,
        extraction.gross_monthly_income is None,
    ]
    return any(required_missing) or (
        extraction.confidence_notes is not None and "uncertain" in extraction.confidence_notes.lower()
    )

data = result.pydantic

if needs_manual_review(data):
    print("Route to manual review queue")
else:
    print("Send to underwriting workflow")

Production Considerations

•
Keep PII inside controlled boundaries
- •Borrower names, SSNs, account numbers, and income data are sensitive.
- •Encrypt at rest and in transit.
- •Mask fields in logs; never dump raw OCR text into application logs.
•
Build an audit trail
- •Store input document hash, OCR output hash, prompt version, model name, timestamp, and extracted JSON.
- •You need this for compliance review and post-close disputes.
•
Respect data residency
- •If your lending stack is regionalized, pin processing to approved cloud regions.
- •Do not send EU borrower data or regulated customer data across borders without policy approval.
•
Add guardrails before decisioning
- •The agent should extract facts only.
- •Credit policy checks belong in deterministic rules or separate scoring services.
- •Use human review for low-confidence extractions and exception cases.

Common Pitfalls

•
Letting the model infer missing values
- •Bad pattern: asking it to “fill in likely income” when the pay stub is unclear.
- •Fix: instruct it to extract only explicit facts and mark uncertainty.
•
Skipping schema validation
- •If you accept free-form JSON from the model, downstream systems will break on field drift.
- •Fix: use output_pydantic plus your own validation layer before persistence.
•
Mixing extraction with underwriting decisions
- •A document parser should not decide eligibility.
- •Fix: keep extraction isolated from policy engines so compliance teams can audit each step independently.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit