How to Build a document extraction Agent Using LangChain in Python for healthcare

By Cyprian AaronsUpdated 2026-04-21

document-extractionlangchainpythonhealthcare

A document extraction agent in healthcare takes unstructured clinical documents — referrals, discharge summaries, lab reports, prior auth forms — and turns them into structured data you can route into EHR workflows, claims systems, or care coordination tools. The point is not just automation; it’s reducing manual transcription errors, speeding up intake, and creating an auditable extraction pipeline that respects compliance and data residency constraints.

Architecture

•
Document ingestion layer
- •Accept PDFs, scans, DOCX files, or text exports from hospital systems.
- •Normalize inputs before extraction so the downstream model sees consistent content.
•
Text extraction and chunking
- •Use PyPDFLoader, UnstructuredFileLoader, or OCR upstream if the source is image-based.
- •Split long documents with RecursiveCharacterTextSplitter to keep prompts within context limits.
•
Structured extraction chain
- •Use LangChain’s ChatPromptTemplate plus a chat model such as ChatOpenAI.
- •Force a schema with PydanticOutputParser so output matches your clinical fields.
•
Validation and guardrails
- •Validate extracted fields against domain rules: dates, ICD codes, medication names, provider IDs.
- •Reject or flag low-confidence outputs for human review.
•
Audit and storage layer
- •Persist raw input hash, extracted JSON, model version, prompt version, and timestamp.
- •This is mandatory if you need traceability for PHI handling and internal audits.

Implementation

1. Install the core packages

You want LangChain plus a PDF loader and a parser that can enforce structure.

pip install langchain langchain-openai langchain-community pydantic pypdf

2. Define the extraction schema

For healthcare documents, don’t ask the model for free-form text. Define exactly what you want back.

from typing import Optional
from pydantic import BaseModel, Field

class ClinicalDocument(BaseModel):
    patient_name: str = Field(description="Full patient name")
    date_of_birth: Optional[str] = Field(default=None, description="Patient date of birth in YYYY-MM-DD")
    document_date: Optional[str] = Field(default=None, description="Document date in YYYY-MM-DD")
    provider_name: Optional[str] = Field(default=None, description="Treating provider or facility name")
    diagnosis: Optional[str] = Field(default=None, description="Primary diagnosis or reason for visit")
    medications: list[str] = Field(default_factory=list, description="List of medications mentioned")
    procedures: list[str] = Field(default_factory=list, description="List of procedures or tests mentioned")
    follow_up_instructions: Optional[str] = Field(default=None, description="Follow-up instructions if present")

3. Build the LangChain extraction chain

This pattern uses PyPDFLoader, RecursiveCharacterTextSplitter, ChatPromptTemplate, and PydanticOutputParser. It extracts structured data from a clinical PDF and returns validated Python objects.

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import PydanticOutputParser

loader = PyPDFLoader("sample_clinical_note.pdf")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

parser = PydanticOutputParser(pydantic_object=ClinicalDocument)

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You extract structured data from healthcare documents. "
     "Only use facts present in the text. "
     "If a field is missing, leave it null or empty."),
    ("human",
     "Extract the following fields from this document chunk:\n{format_instructions}\n\n"
     "Document text:\n{text}")
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

chain = prompt | llm | parser

results = []
for chunk in chunks:
    parsed = chain.invoke({
        "text": chunk.page_content,
        "format_instructions": parser.get_format_instructions()
    })
    results.append(parsed)

print(results[0].model_dump())

4. Add a simple aggregation step

Healthcare notes are often split across pages. You usually need to merge partial results into one record before sending downstream.

def merge_documents(items: list[ClinicalDocument]) -> ClinicalDocument:
    base = items[0]
    for item in items[1:]:
        if not base.date_of_birth and item.date_of_birth:
            base.date_of_birth = item.date_of_birth
        if not base.document_date and item.document_date:
            base.document_date = item.document_date
        if not base.provider_name and item.provider_name:
            base.provider_name = item.provider_name
        if not base.diagnosis and item.diagnosis:
            base.diagnosis = item.diagnosis
        base.medications.extend([m for m in item.medications if m not in base.medications])
        base.procedures.extend([p for p in item.procedures if p not in base.procedures])
        if not base.follow_up_instructions and item.follow_up_instructions:
            base.follow_up_instructions = item.follow_up_instructions
    return base

final_record = merge_documents(results)
print(final_record.model_dump())

Production Considerations

•
Protect PHI at every hop
- •Encrypt documents at rest and in transit.
- •Redact unnecessary identifiers before sending text to the model when possible.
•
Keep audit trails
- •Store input document ID, hash of source text, extracted output, prompt version, model name, and user/service account.
- •If an auditor asks why a field was populated incorrectly, you need reproducibility.
•
Respect data residency
- •Make sure the model endpoint runs in the correct region.
- •For regulated deployments, avoid routing PHI to unsupported jurisdictions or shared consumer endpoints.
•
Add human review thresholds
- •Flag records when confidence is low or when critical fields conflict.
- •In healthcare you do not want silent failures on patient identity or medication lists.

Common Pitfalls

•
Using free-form output instead of a schema
- •Mistake: asking the LLM to “summarize” a document.
- •Fix: use PydanticOutputParser or another strict parser so downstream systems receive predictable JSON.
•
Sending entire scanned packets without preprocessing
- •Mistake: dumping multi-page PDFs directly into one prompt.
- •Fix: load with PyPDFLoader, split with RecursiveCharacterTextSplitter, then aggregate results across chunks.
•
Ignoring validation for clinical fields
- •Mistake: trusting whatever the model extracts for DOBs, meds, or diagnoses.
- •Fix: validate formats and cross-check against known vocabularies or business rules before writing into EHR workflows.

A good healthcare extraction agent is boring in the right way. It should be deterministic where it matters, observable everywhere else, and designed so compliance teams can inspect exactly how each field was produced.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit