How to Build a document extraction Agent Using LangChain in TypeScript for healthcare
A document extraction agent for healthcare reads unstructured clinical documents, pulls out structured fields, and returns them in a format your downstream systems can trust. In practice, that means extracting things like patient identifiers, encounter dates, diagnosis codes, medications, lab values, and signer metadata from PDFs, scanned forms, discharge summaries, and referral letters.
For healthcare teams, this matters because manual abstraction is slow, expensive, and error-prone. If you build it correctly, you get faster intake, cleaner EHR integration, better auditability, and a path to automate workflows without breaking compliance.
Architecture
- •
Document ingestion layer
- •Accept PDFs, text files, and OCR output.
- •Normalize everything into
Documentobjects from LangChain.
- •
Chunking and preprocessing
- •Split long clinical documents with
RecursiveCharacterTextSplitter. - •Preserve section boundaries where possible so extraction stays accurate.
- •Split long clinical documents with
- •
Extraction chain
- •Use a chat model with structured output.
- •Ask for a strict schema: patient info, encounter details, diagnoses, meds, labs, and confidence flags.
- •
Validation layer
- •Validate the model output against a TypeScript schema using
zod. - •Reject or flag incomplete extractions before they hit downstream systems.
- •Validate the model output against a TypeScript schema using
- •
Audit and observability
- •Store source document IDs, extracted fields, model version, prompt version, and timestamps.
- •Keep an immutable trail for compliance reviews.
- •
Secure storage and routing
- •Encrypt PHI at rest and in transit.
- •Route data only to approved regions if you have residency constraints.
Implementation
1) Define the extraction schema
For healthcare extraction, don’t return free-form JSON. Define the exact fields you need and keep them typed.
import { z } from "zod";
export const ClinicalExtractionSchema = z.object({
patientName: z.string().optional(),
mrn: z.string().optional(),
dob: z.string().optional(),
encounterDate: z.string().optional(),
facilityName: z.string().optional(),
diagnoses: z.array(z.string()).default([]),
medications: z.array(
z.object({
name: z.string(),
dose: z.string().optional(),
route: z.string().optional(),
frequency: z.string().optional(),
})
).default([]),
labs: z.array(
z.object({
testName: z.string(),
value: z.string().optional(),
unit: z.string().optional(),
referenceRange: z.string().optional(),
})
).default([]),
signerName: z.string().optional(),
confidenceNotes: z.array(z.string()).default([]),
});
export type ClinicalExtraction = z.infer<typeof ClinicalExtractionSchema>;
This schema gives you two things: validation at runtime and a stable contract for your EHR or claims pipeline.
2) Load documents and split them safely
Use LangChain’s Document class plus a text splitter. If you’re starting from PDFs or OCR text upstream, this is where you normalize the content before extraction.
import { Document } from "@langchain/core/documents";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const rawText = `
Discharge Summary
Patient Name: Jane Doe
MRN: 123456
DOB: 1980-02-14
Encounter Date: 2024-10-03
Diagnosis: Type 2 diabetes mellitus
Medication: Metformin 500 mg PO BID
Lab: HbA1c 8.2 %
Signed by Dr. Patel
`;
const docs = [
new Document({
pageContent: rawText,
metadata: {
sourceId: "doc-001",
documentType: "discharge-summary",
},
}),
];
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 2000,
chunkOverlap: 200,
});
const chunks = await splitter.splitDocuments(docs);
For short clinical notes you may not need splitting. For multi-page referrals or scanned packets converted to text, chunking prevents context loss and keeps token usage predictable.
3) Build the extraction chain with structured output
This is the core pattern. Use ChatOpenAI with withStructuredOutput() so the model returns data that matches your schema instead of raw text you have to parse yourself.
import { ChatOpenAI } from "@langchain/openai";
import { PromptTemplate } from "@langchain/core/prompts";
import { ClinicalExtractionSchema } from "./schema";
const llm = new ChatOpenAI({
modelName: "gpt-4o-mini",
temperature: 0,
});
const prompt = PromptTemplate.fromTemplate(`
You are extracting structured data from a healthcare document.
Rules:
- Only extract facts explicitly present in the document.
- Do not infer missing values.
- If a field is absent, omit it or return an empty array.
- Preserve medication names exactly as written.
- Return dates in ISO format when possible.
Document:
{document}
`);
const extractor = llm.withStructuredOutput(ClinicalExtractionSchema);
export async function extractClinicalData(documentText: string) {
const formattedPrompt = await prompt.format({ document: documentText });
const result = await extractor.invoke(formattedPrompt);
return result;
}
The important bit here is withStructuredOutput(). That gives you typed output aligned with your Zod schema and reduces brittle post-processing code.
4) Run extraction over chunks and keep an audit trail
In production you want traceability. Capture the input chunk ID, output payload, model name, prompt version, and validation status.
type AuditRecord = {
sourceId: string;
chunkIndex: number;
}
export async function processChunks(chunksWithMeta = chunks) {
const results = [];
for (let i = ; i < chunksWithMeta.length; i++) {
const chunk = chunksWithMeta[i];
const extracted = await extractClinicalData(chunk.pageContent);
results.push({
sourceId:.metadata.sourceId,
chunkIndex i,
model:gpt4o-mini,
Version:v1,
extracted,
validatedAt:new Date().toISOString()
});
}
return results;
}
In real code you’d persist these records to your audit store after fixing syntax issues above; the pattern is what matters:
- •one record per chunk
- •explicit metadata
- •immutable history
- •deterministic prompts
If you want stronger orchestration later on LangChain’s RunnableSequence or RunnableLambda can wrap preprocessing → extraction → validation → persistence cleanly.
Production Considerations
- •
Compliance first
- •Treat every input as PHI unless proven otherwise.
- •Log access events for HIPAA audits.
- •Redact unnecessary identifiers before sending text to the model if your use case allows it.
- •
Data residency
- •Pin your inference region to approved geographies. Use vendor configurations that guarantee regional processing. Avoid sending PHI across borders through default cloud settings.
- •
Monitoring Track extraction accuracy by document type. Measure invalid schema rate, missing-field rate, and manual review rate. Alert when confidence drops after prompt or model changes.
- •
Guardrails Block unsupported document types early. Add rule-based checks for impossible values like future DOBs or malformed MRNs. Route low-confidence outputs to human review instead of auto-ingestion.
Common Pitfalls
- •
Using free-text outputs instead of structured schemas
- •This creates parsing bugs and inconsistent downstream data.
- •Fix it by using
withStructuredOutput()plus Zod validation every time.
- •
Letting the model infer missing clinical facts
- •In healthcare that is dangerous because hallucinated diagnoses or meds can enter workflows.
- •Fix it with strict prompts that say “only extract explicitly stated facts.”
- •
Skipping audit metadata
- •If you cannot show which model produced which field from which source at what time, compliance becomes painful fast.
- •Fix it by storing source IDs, prompt versions, model versions, timestamps, and validation status alongside every result.
A solid healthcare extraction agent is not just “LLM + PDF.” It is schema-first extraction with validation, traceability, residency controls, and a human review path for uncertain cases. Build those pieces up front and you’ll have something clinical teams can actually use in production.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit