How to Build a document extraction Agent Using AutoGen in Python for healthcare
A document extraction agent for healthcare reads clinical PDFs, faxes, scanned forms, and discharge summaries, then turns them into structured data you can route into EHR workflows, claims systems, or prior auth pipelines. It matters because healthcare teams still waste time rekeying high-volume documents, and every manual handoff adds delay, cost, and error risk.
Architecture
- •
Document intake layer
- •Accepts PDFs, images, or text payloads from secure storage or an internal queue.
- •Normalizes file paths, MIME types, and metadata like patient ID and source system.
- •
OCR / text extraction tool
- •Converts scanned pages into text before the LLM sees anything.
- •For production, this is usually a separate service like Azure Document Intelligence, AWS Textract, or Tesseract behind an internal API.
- •
AutoGen assistant agent
- •Uses
AssistantAgentto extract fields into a strict schema. - •The prompt should define the target document type and output contract.
- •Uses
- •
Validation / review agent
- •Uses a second
AssistantAgentorUserProxyAgentto check for missing fields, malformed JSON, and unsafe assumptions. - •This is where you enforce “extract only what is present in the document.”
- •Uses a second
- •
Secure execution wrapper
- •Uses
UserProxyAgentwith controlled tool execution. - •Keeps file access, OCR calls, and schema validation inside your trust boundary.
- •Uses
- •
Audit and persistence layer
- •Stores extracted JSON plus source document hash, timestamps, model version, and reviewer outcome.
- •Required for traceability in regulated workflows.
Implementation
1) Install AutoGen and define the extraction contract
Use AutoGen’s Python package and keep the output format tight. For healthcare documents, do not ask for free-form summaries when you need structured fields.
pip install pyautogen pydantic
from pydantic import BaseModel, Field
from typing import Optional
class DischargeSummary(BaseModel):
patient_name: Optional[str] = Field(default=None)
dob: Optional[str] = Field(default=None)
mrn: Optional[str] = Field(default=None)
admission_date: Optional[str] = Field(default=None)
discharge_date: Optional[str] = Field(default=None)
diagnosis: Optional[str] = Field(default=None)
medications: list[str] = Field(default_factory=list)
2) Build the AutoGen agents
The extraction agent should produce JSON only. The reviewer agent checks completeness and flags uncertainty instead of inventing values.
import os
from autogen import AssistantAgent, UserProxyAgent
llm_config = {
"model": "gpt-4o-mini",
"api_key": os.environ["OPENAI_API_KEY"],
}
extractor = AssistantAgent(
name="extractor",
llm_config=llm_config,
system_message=(
"You extract data from healthcare documents. "
"Return only valid JSON matching the requested schema. "
"If a field is missing or unclear, use null. "
"Do not infer values."
),
)
reviewer = AssistantAgent(
name="reviewer",
llm_config=llm_config,
system_message=(
"You validate extracted healthcare data. "
"Check for missing required fields, invalid dates, and unsupported claims. "
"Return a concise JSON object with issues_found and approved."
),
)
user = UserProxyAgent(
name="user",
human_input_mode="NEVER",
)
3) Run extraction on a document text payload
In production you would pass OCR output here. This example uses plain text so the AutoGen pattern is clear.
document_text = """
DISCHARGE SUMMARY
Patient Name: Jane Doe
DOB: 1984-07-12
MRN: 123456
Admission Date: 2025-01-10
Discharge Date: 2025-01-14
Diagnosis: Community acquired pneumonia
Medications:
- Amoxicillin 500mg TID x7 days
"""
extraction_prompt = f"""
Extract this discharge summary into JSON with keys:
patient_name, dob, mrn, admission_date, discharge_date,
diagnosis, medications
Document:
{document_text}
"""
result = user.initiate_chat(
extractor,
message=extraction_prompt,
)
raw_output = result.chat_history[-1]["content"]
print(raw_output)
That initiate_chat() call is the core AutoGen pattern. The UserProxyAgent drives the conversation while the AssistantAgent performs the extraction.
4) Add a validation pass before persistence
Do not write model output directly to your database. Validate structure first and reject anything that does not match policy.
review_prompt = f"""
Review this extracted JSON for accuracy and completeness:
{raw_output}
Return JSON with:
approved: true/false
issues_found: [list of strings]
"""
review_result = user.initiate_chat(
reviewer,
message=review_prompt,
)
review_output = review_result.chat_history[-1]["content"]
print(review_output)
A practical pattern is:
| Step | Responsibility | Output |
|---|---|---|
| OCR | Convert image/PDF to text | Plain text |
| Extraction | Map text to schema | JSON |
| Review | Check validity and policy | Approval flag + issues |
| Persist | Store approved record | Audit-ready payload |
Production Considerations
- •
Compliance controls
- •Treat every document as PHI until proven otherwise.
- •Encrypt at rest and in transit, restrict access by role, and log every read/write event for audit trails.
- •
Data residency
- •Keep OCR outputs and LLM calls inside approved regions if your contracts require it.
- •If you use a hosted model endpoint, verify region pinning and subprocessor terms before sending PHI.
- •
Monitoring
- •Track extraction accuracy by document type: discharge summaries behave differently from lab reports or referral letters.
- •Log schema failures, null-field rates, reviewer rejections, latency per page, and token usage.
- •
Guardrails
- •Force JSON-only outputs with strict schemas.
- •Reject inferred values for allergies, diagnoses, medication doses, or dates unless explicitly present in the source document.
Common Pitfalls
- •
Letting the model “fill in” missing clinical facts
- •This creates silent errors that are hard to detect downstream.
- •Avoid it by instructing the extractor to return
nullfor unknown fields and validating against source text.
- •
Skipping OCR quality checks
- •Bad scans produce garbage extractions even with a strong prompt.
- •Avoid it by measuring OCR confidence and routing low-quality pages to manual review before LLM processing.
- •
Storing raw PHI without an audit trail
- •In healthcare this becomes a compliance problem fast.
- •Avoid it by persisting document hash, model version, timestamps, reviewer status, and access logs alongside the extracted record.
If you keep the agent narrow—extract first, validate second—you get something usable in real healthcare workflows instead of a demo that looks good on clean PDFs only.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit