How to Build a document extraction Agent Using CrewAI in Python for healthcare

By Cyprian AaronsUpdated 2026-04-21

document-extractioncrewaipythonhealthcare

A document extraction agent for healthcare reads unstructured clinical documents — referrals, discharge summaries, lab reports, prior auth forms — and turns them into structured fields you can route into downstream systems. The point is not just automation; it’s reducing manual chart review, improving turnaround time, and keeping PHI handling controlled enough to satisfy compliance and audit requirements.

Architecture

A production healthcare extraction agent usually needs these components:

•
Document intake
- •Receives PDFs, scans, or text from an internal upload service or secure storage bucket.
- •Enforces file type checks and size limits before any LLM call.
•
Text extraction layer
- •Uses OCR for scanned documents and plain text parsing for digital PDFs.
- •Normalizes page order, headers, footers, and line breaks.
•
Extraction agent
- •A crewai.Agent with a narrow role: extract specific clinical fields into a fixed schema.
- •Should not summarize or “interpret” beyond the requested fields.
•
Validation step
- •A second agent or deterministic validator checks missing fields, malformed dates, inconsistent units, and confidence issues.
- •This is where you catch hallucinated diagnoses or impossible values.
•
Output writer
- •Persists structured JSON to a database or FHIR-facing service.
- •Stores provenance: source document ID, page number, extracted value, and timestamp.
•
Audit and policy layer
- •Logs every prompt, model response, and final payload.
- •Enforces data residency, retention policy, and access controls for PHI.

Implementation

1) Install CrewAI and prepare your environment

Use a model provider that your organization has approved for PHI handling. In healthcare, the technical pattern matters less than the deployment boundary: private networking, encryption at rest, and no uncontrolled prompt logging.

pip install crewai crewai-tools python-dotenv pydantic

Set your API key in environment variables:

export OPENAI_API_KEY="your_key"

2) Define the extraction schema and tools

Keep the output schema strict. For healthcare workflows, free-form JSON is a liability unless you validate it immediately.

from pydantic import BaseModel, Field
from crewai import Agent, Task, Crew, Process
from crewai_tools import tool

class ClinicalExtraction(BaseModel):
    patient_name: str = Field(..., description="Full patient name")
    dob: str = Field(..., description="Date of birth in YYYY-MM-DD")
    encounter_date: str = Field(..., description="Encounter date in YYYY-MM-DD")
    diagnosis: str = Field(..., description="Primary diagnosis")
    medications: list[str] = Field(default_factory=list)
    allergies: list[str] = Field(default_factory=list)

@tool("extract_text_from_document")
def extract_text_from_document(document_text: str) -> str:
    """Return normalized text from an already-OCR'd clinical document."""
    return " ".join(document_text.split())

3) Build the agent and task with CrewAI

The key pattern is one focused agent plus one task with explicit output constraints. Don’t ask the model to do classification, summarization, coding, and extraction in one pass.

from crewai import Agent

extractor = Agent(
    role="Healthcare Document Extraction Specialist",
    goal="Extract clinically relevant fields from medical documents into a strict schema",
    backstory=(
        "You process healthcare documents for downstream clinical operations. "
        "You must preserve factual accuracy and avoid inventing missing data."
    ),
    tools=[extract_text_from_document],
    verbose=True,
)

extraction_task = Task(
    description=(
        "Extract the required fields from the provided clinical document text. "
        "Use only information explicitly present in the source. "
        "If a field is missing, return an empty string or empty list."
    ),
    expected_output=ClinicalExtraction.model_json_schema(),
    agent=extractor,
)

4) Run the crew and validate output before persistence

In production you should validate the response before storing it anywhere. That means type checks plus domain checks like valid date formats and non-empty patient identifiers.

def run_extraction(document_text: str):
    crew = Crew(
        agents=[extractor],
        tasks=[extraction_task],
        process=Process.sequential,
        verbose=True,
    )

    result = crew.kickoff(inputs={"document_text": document_text})

    # CrewAI returns task output; normalize to string then validate upstream contract
    raw_output = str(result)

    return raw_output

sample_doc = """
Patient Name: Jane Doe
DOB: 1982-04-11
Encounter Date: 2025-02-18
Diagnosis: Type 2 diabetes mellitus
Medications: Metformin 500 mg BID; Lisinopril 10 mg daily
Allergies: Penicillin
"""

print(run_extraction(sample_doc))

If you want stronger control over structured output, wrap the crew response with your own parser using ClinicalExtraction.model_validate_json(...) after instructing the agent to emit JSON only. That gives you deterministic failure modes instead of silent drift.

Production Considerations

•
Compliance boundaries
- •Treat every document as PHI by default.
- •Use approved model endpoints only if they meet HIPAA obligations under your org’s legal review.
- •Log metadata separately from content when possible.
•
Data residency
- •Keep inference inside your required region.
- •If documents must stay in-country or on-premises, deploy CrewAI behind internal services that call region-locked models or local inference endpoints.
•
Auditability
- •Persist document ID, prompt version, model name, response hash, and extraction timestamp.
- •For regulated workflows like prior auth or claims support, you need traceability back to source text.
•
Guardrails
- •Reject outputs with invalid dates, unsupported codes, or invented medications.
- •Add human review for low-confidence extractions or high-risk fields like diagnosis codes and allergy lists.

Common Pitfalls

•
Using one prompt for everything
- •Don’t ask the agent to extract fields while also summarizing the chart.
- •Split tasks by function so failures are easier to detect and test.
•
Skipping validation
- •Never trust raw LLM output in a healthcare pipeline.
- •Validate against a Pydantic model and add domain rules for dates, units, ICD-style values, and required identifiers.
•
Ignoring PHI handling
- •Don’t send documents to unmanaged services or log full payloads into general application logs.
- •Redact where possible, encrypt everywhere else، and keep an audit trail that security teams can inspect later.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit