How to Build a document extraction Agent Using LlamaIndex in Python for healthcare
A document extraction agent for healthcare takes unstructured clinical PDFs, referrals, discharge summaries, and prior auth forms, then turns them into structured data you can route into downstream systems. That matters because healthcare workflows still depend on humans reading documents line by line, and that creates delays, missed fields, and inconsistent data entry.
Architecture
- •
Document ingestion layer
- •Pulls PDFs, scans, and text files from S3, SharePoint, EMR exports, or an internal document store.
- •Normalizes file paths and metadata like patient ID, encounter ID, source system, and retention policy.
- •
Text extraction and chunking
- •Uses LlamaIndex readers to load documents.
- •Splits long documents into chunks that preserve section boundaries like “Assessment,” “Plan,” or “Medication List.”
- •
Extraction schema
- •Defines the fields you need: patient name, DOB, diagnosis codes, medications, allergies, provider names, dates.
- •Keeps output constrained so the model returns structured JSON instead of free text.
- •
LLM extraction pipeline
- •Sends chunks to an LLM through LlamaIndex.
- •Uses a query engine or extractor pattern to map raw text into your schema.
- •
Validation and audit layer
- •Validates required fields, formats dates, checks ICD-10 or medication normalization rules.
- •Stores source chunk references so every extracted field is traceable back to the original document.
- •
Secure persistence
- •Writes results to a database or queue with encryption at rest.
- •Enforces residency rules by keeping PHI in the approved region.
Implementation
- •Install dependencies and load documents
Use SimpleDirectoryReader for local files during development. In production, swap this out for a custom loader that pulls from your document store and attaches healthcare metadata.
from llama_index.core import SimpleDirectoryReader
docs = SimpleDirectoryReader(
input_dir="./healthcare_docs",
recursive=True,
filename_as_id=True
).load_data()
print(f"Loaded {len(docs)} documents")
print(docs[0].metadata)
- •Build an index over the documents
For extraction workflows, VectorStoreIndex is a practical default because it lets you retrieve the most relevant chunks before asking the model to extract fields. That keeps prompts smaller and reduces noise from long clinical notes.
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(similarity_top_k=3)
- •Define the extraction target and query it
In healthcare you want explicit field names. Ask for structured output in JSON so your downstream validation code can reject incomplete or malformed records.
from pydantic import BaseModel, Field
from typing import List
class ClinicalExtraction(BaseModel):
patient_name: str = Field(description="Full patient name")
date_of_birth: str = Field(description="Date of birth in YYYY-MM-DD")
diagnosis: List[str] = Field(description="List of diagnoses mentioned")
medications: List[str] = Field(description="List of medications mentioned")
allergies: List[str] = Field(description="List of allergies mentioned")
prompt = """
Extract the following fields from the document context:
- patient_name
- date_of_birth
- diagnosis
- medications
- allergies
Return only valid JSON matching the schema.
"""
response = query_engine.query(prompt)
print(response.response)
- •Use a structured extractor when you need tighter control
If you need stronger schema enforcement across many documents, use LlamaIndex’s structured extraction flow with PydanticProgramExtractor. This is the pattern I’d use for production document pipelines where downstream systems expect clean records.
from llama_index.core.schema import TextNode
from llama_index.core.extractors import PydanticProgramExtractor
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini", temperature=0)
extractor = PydanticProgramExtractor(
llm=llm,
output_cls=ClinicalExtraction,
prompt_template_str=(
"Extract clinical fields from the text below.\n"
"Return only data supported by the source text.\n"
"{context_str}"
),
)
nodes = [TextNode(text=doc.text, metadata=doc.metadata) for doc in docs]
extracted_nodes = extractor.extract(nodes)
for node in extracted_nodes[:1]:
print(node.metadata)
Production Considerations
- •
Compliance first
- •Treat all extracted content as PHI.
- •Run inside a HIPAA-aligned environment with access controls, encryption at rest/in transit, and audit logs on every request.
- •
Data residency
- •Keep both source documents and model traffic in approved regions.
- •If you use a hosted LLM endpoint, confirm regional processing guarantees and retention settings before sending any PHI.
- •
Human review for low-confidence extractions
- •Route ambiguous cases to a reviewer when key fields are missing or conflicting.
- •Don’t auto-write extracted diagnoses or medication lists into the EHR without validation.
- •
Monitoring
- •Track field-level accuracy, missing-field rate, hallucination rate, and turnaround time.
- •Log source document IDs plus chunk references so compliance teams can trace every output back to evidence.
Common Pitfalls
- •
Sending raw PHI without controls
- •Mistake: piping full charts directly into an external model endpoint.
- •Avoid it by using approved infrastructure, redaction where possible, and vendor settings that disable training/retention on your data.
- •
Letting the model free-form its output
- •Mistake: accepting prose summaries when your workflow needs structured fields.
- •Avoid it by using Pydantic schemas, strict parsing, and validation before persistence.
- •
Ignoring document provenance
- •Mistake: storing extracted values without linking them back to source pages or chunks.
- •Avoid it by persisting metadata from
Documentobjects and keeping traceability for audits and disputes.
- •
Over-trusting single-pass extraction
- •Mistake: assuming one model call will correctly capture every field from noisy scans or multi-page referrals.
- •Avoid it by chunking intelligently, retrieving relevant sections first with
VectorStoreIndex, and adding fallback review queues for uncertain cases.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit