How to Build a document extraction Agent Using LangGraph in Python for healthcare

By Cyprian AaronsUpdated 2026-04-21

document-extractionlanggraphpythonhealthcare

A document extraction agent for healthcare takes unstructured clinical PDFs, scanned referrals, discharge summaries, and lab reports, then turns them into structured data you can route into EHRs, claims systems, prior auth workflows, or analytics pipelines. It matters because healthcare ops still runs on documents, and the difference between a clean extraction pipeline and a manual queue is time, cost, and patient safety.

Architecture

•
Ingestion layer
- •Accept PDFs, images, or text payloads from secure storage or an internal API.
- •Normalize file type early so downstream nodes work with consistent inputs.
•
Document parsing node
- •Extract raw text from PDFs and OCR output from scans.
- •Preserve page numbers and offsets for auditability.
•
Extraction node
- •Use an LLM to map text into a strict schema like patient name, DOB, MRN, diagnosis codes, medications, and dates.
- •Return structured JSON only.
•
Validation node
- •Check required fields, date formats, code formats, and confidence thresholds.
- •Reject or route low-quality outputs to human review.
•
Human review / exception path
- •Handle ambiguous cases like poor scans, missing identifiers, or conflicting dates.
- •Keep a reviewer in the loop for PHI-sensitive decisions.
•
Persistence and audit layer
- •Store extracted records plus provenance: source document ID, page references, model version, timestamps.
- •This is what you need for compliance and traceability.

Implementation

1) Define the state and schema

Use a typed state object so each node has a clear contract. For healthcare extraction, keep the raw text separate from the structured output and validation status.

from typing import TypedDict, Optional
from langgraph.graph import StateGraph, START, END
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI

class ExtractedDoc(BaseModel):
    patient_name: Optional[str] = None
    dob: Optional[str] = None
    mrn: Optional[str] = None
    diagnosis: list[str] = Field(default_factory=list)
    medications: list[str] = Field(default_factory=list)
    confidence: float = 0.0

class DocState(TypedDict):
    raw_text: str
    extracted: dict
    validated: bool
    needs_review: bool

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

2) Build the extraction and validation nodes

This pattern keeps extraction deterministic enough for production. The model returns structured data through with_structured_output, then validation decides whether to continue or branch.

def extract_node(state: DocState):
    extractor = llm.with_structured_output(ExtractedDoc)
    result = extractor.invoke(
        f"""
        Extract healthcare fields from this document.
        Return patient_name, dob in YYYY-MM-DD if available,
        mrn, diagnosis list, medications list, and confidence.
        
        DOCUMENT:
        {state["raw_text"]}
        """
    )
    return {"extracted": result.model_dump()}

def validate_node(state: DocState):
    data = state["extracted"]
    required = ["patient_name", "dob", "mrn"]
    missing = [k for k in required if not data.get(k)]
    
    valid_date = isinstance(data.get("dob"), str) and len(data["dob"]) == 10
    high_confidence = float(data.get("confidence", 0)) >= 0.75
    
    needs_review = bool(missing) or not valid_date or not high_confidence
    return {"validated": not needs_review, "needs_review": needs_review}

3) Route low-confidence documents to review

LangGraph gives you explicit control over branching with add_conditional_edges. That is the right pattern for healthcare because you do not want silent failures on PHI-bearing documents.

def route_after_validation(state: DocState):
    return "review" if state["needs_review"] else "done"

def review_node(state: DocState):
    # In production this would create a task in your reviewer queue.
    print("Routing document to human review:", state["extracted"])
    return {}

graph = StateGraph(DocState)
graph.add_node("extract", extract_node)
graph.add_node("validate", validate_node)
graph.add_node("review", review_node)

graph.add_edge(START, "extract")
graph.add_edge("extract", "validate")
graph.add_conditional_edges("validate", route_after_validation, {
    "review": "review",
    "done": END,
})
graph.add_edge("review", END)

app = graph.compile()

4) Run it with a real input payload

Keep execution isolated per document. In healthcare environments you usually want one run per record with immutable logs tied to the source object ID.

result = app.invoke({
    "raw_text": """
    Discharge Summary
    Patient Name: Jane Doe
    DOB: 1982-04-11
    MRN: H1234567
    Diagnoses: Hypertension; Type 2 diabetes mellitus
    Medications: Metformin; Lisinopril
    """,
    "extracted": {},
    "validated": False,
    "needs_review": False,
})

print(result)

Production Considerations

•
Compliance controls
- •Treat all inputs as PHI by default.
- •Encrypt at rest and in transit.
- •Log access by user/service identity so you can support HIPAA audits.
•
Data residency
- •Keep document processing inside approved regions.
- •If your organization restricts PHI to specific jurisdictions, pin model endpoints and storage buckets accordingly.
•
Monitoring
- •Track extraction accuracy by field type, not just overall success rate.
- •Watch for drift in scan quality, OCR failure rates, and reviewer overrides.
•
Guardrails
- •Add schema validation before persistence.
- •Block writes when critical fields are missing or when confidence drops below your threshold.
- •Never let the agent directly overwrite source-of-truth clinical systems without human approval on exceptions.

Common Pitfalls

•
Using free-form LLM output
- •Problem: the model returns prose instead of machine-readable fields.
- •Fix: use with_structured_output() with a Pydantic schema every time.
•
Skipping provenance
- •Problem: you cannot explain where a field came from during audit or dispute resolution.
- •Fix: store source document IDs, page numbers, extraction timestamps, and model version alongside the extracted record.
•
Auto-accepting low-confidence extractions
- •Problem: bad OCR or ambiguous handwriting gets written into downstream systems.
- •Fix: branch low-confidence cases into human review with add_conditional_edges() and enforce hard thresholds before persistence.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit