How to Build a document extraction Agent Using LangChain in Python for healthcare
A document extraction agent in healthcare takes unstructured clinical documents — referrals, discharge summaries, lab reports, prior auth forms — and turns them into structured data you can route into EHR workflows, claims systems, or care coordination tools. The point is not just automation; it’s reducing manual transcription errors, speeding up intake, and creating an auditable extraction pipeline that respects compliance and data residency constraints.
Architecture
- •
Document ingestion layer
- •Accept PDFs, scans, DOCX files, or text exports from hospital systems.
- •Normalize inputs before extraction so the downstream model sees consistent content.
- •
Text extraction and chunking
- •Use
PyPDFLoader,UnstructuredFileLoader, or OCR upstream if the source is image-based. - •Split long documents with
RecursiveCharacterTextSplitterto keep prompts within context limits.
- •Use
- •
Structured extraction chain
- •Use LangChain’s
ChatPromptTemplateplus a chat model such asChatOpenAI. - •Force a schema with
PydanticOutputParserso output matches your clinical fields.
- •Use LangChain’s
- •
Validation and guardrails
- •Validate extracted fields against domain rules: dates, ICD codes, medication names, provider IDs.
- •Reject or flag low-confidence outputs for human review.
- •
Audit and storage layer
- •Persist raw input hash, extracted JSON, model version, prompt version, and timestamp.
- •This is mandatory if you need traceability for PHI handling and internal audits.
Implementation
1. Install the core packages
You want LangChain plus a PDF loader and a parser that can enforce structure.
pip install langchain langchain-openai langchain-community pydantic pypdf
2. Define the extraction schema
For healthcare documents, don’t ask the model for free-form text. Define exactly what you want back.
from typing import Optional
from pydantic import BaseModel, Field
class ClinicalDocument(BaseModel):
patient_name: str = Field(description="Full patient name")
date_of_birth: Optional[str] = Field(default=None, description="Patient date of birth in YYYY-MM-DD")
document_date: Optional[str] = Field(default=None, description="Document date in YYYY-MM-DD")
provider_name: Optional[str] = Field(default=None, description="Treating provider or facility name")
diagnosis: Optional[str] = Field(default=None, description="Primary diagnosis or reason for visit")
medications: list[str] = Field(default_factory=list, description="List of medications mentioned")
procedures: list[str] = Field(default_factory=list, description="List of procedures or tests mentioned")
follow_up_instructions: Optional[str] = Field(default=None, description="Follow-up instructions if present")
3. Build the LangChain extraction chain
This pattern uses PyPDFLoader, RecursiveCharacterTextSplitter, ChatPromptTemplate, and PydanticOutputParser. It extracts structured data from a clinical PDF and returns validated Python objects.
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import PydanticOutputParser
loader = PyPDFLoader("sample_clinical_note.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
parser = PydanticOutputParser(pydantic_object=ClinicalDocument)
prompt = ChatPromptTemplate.from_messages([
("system",
"You extract structured data from healthcare documents. "
"Only use facts present in the text. "
"If a field is missing, leave it null or empty."),
("human",
"Extract the following fields from this document chunk:\n{format_instructions}\n\n"
"Document text:\n{text}")
])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
chain = prompt | llm | parser
results = []
for chunk in chunks:
parsed = chain.invoke({
"text": chunk.page_content,
"format_instructions": parser.get_format_instructions()
})
results.append(parsed)
print(results[0].model_dump())
4. Add a simple aggregation step
Healthcare notes are often split across pages. You usually need to merge partial results into one record before sending downstream.
def merge_documents(items: list[ClinicalDocument]) -> ClinicalDocument:
base = items[0]
for item in items[1:]:
if not base.date_of_birth and item.date_of_birth:
base.date_of_birth = item.date_of_birth
if not base.document_date and item.document_date:
base.document_date = item.document_date
if not base.provider_name and item.provider_name:
base.provider_name = item.provider_name
if not base.diagnosis and item.diagnosis:
base.diagnosis = item.diagnosis
base.medications.extend([m for m in item.medications if m not in base.medications])
base.procedures.extend([p for p in item.procedures if p not in base.procedures])
if not base.follow_up_instructions and item.follow_up_instructions:
base.follow_up_instructions = item.follow_up_instructions
return base
final_record = merge_documents(results)
print(final_record.model_dump())
Production Considerations
- •
Protect PHI at every hop
- •Encrypt documents at rest and in transit.
- •Redact unnecessary identifiers before sending text to the model when possible.
- •
Keep audit trails
- •Store input document ID, hash of source text, extracted output, prompt version, model name, and user/service account.
- •If an auditor asks why a field was populated incorrectly, you need reproducibility.
- •
Respect data residency
- •Make sure the model endpoint runs in the correct region.
- •For regulated deployments, avoid routing PHI to unsupported jurisdictions or shared consumer endpoints.
- •
Add human review thresholds
- •Flag records when confidence is low or when critical fields conflict.
- •In healthcare you do not want silent failures on patient identity or medication lists.
Common Pitfalls
- •
Using free-form output instead of a schema
- •Mistake: asking the LLM to “summarize” a document.
- •Fix: use
PydanticOutputParseror another strict parser so downstream systems receive predictable JSON.
- •
Sending entire scanned packets without preprocessing
- •Mistake: dumping multi-page PDFs directly into one prompt.
- •Fix: load with
PyPDFLoader, split withRecursiveCharacterTextSplitter, then aggregate results across chunks.
- •
Ignoring validation for clinical fields
- •Mistake: trusting whatever the model extracts for DOBs, meds, or diagnoses.
- •Fix: validate formats and cross-check against known vocabularies or business rules before writing into EHR workflows.
A good healthcare extraction agent is boring in the right way. It should be deterministic where it matters, observable everywhere else, and designed so compliance teams can inspect exactly how each field was produced.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit