How to Build a document extraction Agent Using LangGraph in TypeScript for healthcare
A document extraction agent for healthcare takes unstructured files like referral letters, discharge summaries, lab reports, and prior authorization forms, then turns them into structured JSON your downstream systems can use. That matters because healthcare operations still run on documents, and the difference between a usable record and a missed field is often a delayed claim, a broken care workflow, or a compliance issue.
Architecture
A production-grade healthcare extraction agent needs these pieces:
- •
Ingestion layer
- •Accepts PDFs, scans, DOCX, and images from secure storage or an internal upload service.
- •Normalizes file metadata like patient ID, source system, facility, and retention policy.
- •
OCR / text extraction layer
- •Converts scanned documents into text.
- •Preserves page numbers and bounding boxes when possible for auditability.
- •
LangGraph orchestration
- •Routes the document through extraction, validation, and escalation steps.
- •Keeps the workflow deterministic enough for regulated environments.
- •
Structured extraction model
- •Produces strict JSON matching a healthcare schema.
- •Extracts entities such as patient name, DOB, MRN, diagnosis codes, medications, dates of service, and provider details.
- •
Validation and compliance layer
- •Checks required fields, date formats, ICD-10/CPT patterns, and PHI handling rules.
- •Flags low-confidence outputs for human review.
- •
Persistence and audit trail
- •Stores raw input references, extracted output, confidence scores, model version, and reviewer actions.
- •Supports data residency constraints by keeping records in-region.
Implementation
1) Define the extraction state and schema
Start with a strict state object. In healthcare you want typed outputs because downstream systems should not guess what dob means or whether mrn is missing.
import { z } from "zod";
export const ExtractionSchema = z.object({
patientName: z.string().optional(),
dateOfBirth: z.string().optional(),
mrn: z.string().optional(),
encounterDate: z.string().optional(),
diagnosisCodes: z.array(z.string()).default([]),
medications: z.array(z.string()).default([]),
providerName: z.string().optional(),
sourceDocumentType: z.string().optional(),
});
export type ExtractionResult = z.infer<typeof ExtractionSchema>;
export type AgentState = {
filePath: string;
rawText?: string;
result?: ExtractionResult;
confidence?: number;
needsReview?: boolean;
};
2) Build nodes for OCR/text loading, extraction, and validation
Use LangGraph’s StateGraph to wire the workflow. The pattern below uses actual LangGraph APIs: StateGraph, .addNode(), .addEdge(), .addConditionalEdges(), and .compile().
import { StateGraph, START, END } from "@langchain/langgraph";
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";
import { zodToJsonSchema } from "zod-to-json-schema";
import { ExtractionSchema } from "./schema.js";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
async function loadDocument(state: AgentState): Promise<Partial<AgentState>> {
// Replace with OCR/PDF parsing in production.
const rawText = await Bun.file(state.filePath).text();
return { rawText };
}
async function extractFields(state: AgentState): Promise<Partial<AgentState>> {
const prompt = `
Extract structured healthcare data from this document.
Return only valid JSON matching this schema:
${JSON.stringify(zodToJsonSchema(ExtractionSchema), null, 2)}
Document:
${state.rawText}
`;
const response = await llm.invoke([new HumanMessage(prompt)]);
const parsed = ExtractionSchema.parse(JSON.parse(response.content as string));
return {
result: parsed,
confidence: estimateConfidence(parsed),
needsReview: false,
};
}
async function validateExtraction(state: AgentState): Promise<Partial<AgentState>> {
const missingCritical =
!state.result?.patientName ||
!state.result?.dateOfBirth ||
!state.result?.mrn;
return {
needsReview: Boolean(missingCritical || (state.confidence ?? 0) < 0.85),
};
}
function estimateConfidence(result: ExtractionResult): number {
const fields = [
result.patientName,
result.dateOfBirth,
result.mrn,
result.encounterDate,
result.providerName,
];
Continue with the graph wiring:
function estimateConfidence(result: ExtractionResult): number {
const filled = fields.filter(Boolean).length;
return filled / fields.length;
}
const graph = new StateGraph<AgentState>()
.addNode("loadDocument", loadDocument)
const graph = new StateGraph<AgentState>()
.addNode("loadDocument", loadDocument)
.addNode("extractFields", extractFields)
.addNode("validateExtraction", validateExtraction)
.addEdge(START, "loadDocument")
.addEdge("loadDocument", "extractFields")
.addEdge("extractFields", "validateExtraction")
.addConditionalEdges("validateExtraction", (state) =>
state.needsReview ? "review" : "done",
{
review: END,
done: END,
}
);
const app = graph.compile();
Why this pattern works
The graph is small on purpose. In regulated workflows you want every step visible:
- •Load the document
- •Extract structured data
- •Validate against business rules
- •Escalate if confidence is low
That makes it easier to audit than a single opaque prompt that tries to do everything at once.
3) Run the agent and persist outputs with audit metadata
You should persist both the extracted payload and metadata about how it was produced. For healthcare teams this is not optional; you need traceability for reviews, disputes, and compliance audits.
async function main() {
const finalState = await app.invoke({
filePath: "/secure/inbox/referral-letter.pdf",
needsReview: false,
confidence: undefined,
rawText: undefined,
result: undefined,
});
async function main() {
const finalState = await app.invoke({
filePath: "/secure/inbox/referral-letter.pdf",
needsReview: false,
confidence: undefined,
rawText: undefined,
result: undefined,
});
console.log({
extracted: finalState.result,
confidence: finalState.confidence,
reviewRequired: finalState.needsReview,
});
}
main();
In production, write that output to an append-only audit store with:
- •document hash
- •tenant/facility ID
- •model name and version
- •timestamp
- •reviewer identity if human approval happens
Production Considerations
- •Data residency
You cannot casually send PHI across regions. Pin your OCR service, LLM endpoint, logs, and object storage to the same approved region as the covered entity’s policy requires.
- •Audit logging
Log every state transition in the graph. Keep raw prompts out of general-purpose logs unless they are redacted; otherwise you create a second PHI problem while solving the first one.
- •Human-in-the-loop fallback
Route low-confidence extractions to a clinical ops queue or HIM team. Use needsReview as a hard gate for claims submission or EHR write-back.
- •Guardrails
Validate outputs against regexes and controlled vocabularies before persistence:
- •ICD-10 format checks
- •MRN format checks per facility
- •Date normalization to ISO-8601
- •Allowed document types only
Common Pitfalls
- •
Using free-form LLM output without schema enforcement
If you skip Zod parsing or similar validation, bad JSON will leak into claims or EHR workflows. Always parse into a strict schema before anything downstream consumes it.
- •
Treating OCR text as ground truth
Scanned documents are noisy. Preserve page references and confidence scores so reviewers can verify uncertain fields quickly instead of re-reading entire packets.
- •
Ignoring PHI boundaries in logs and telemetry
Debug logs often become shadow data stores. Redact names, MRNs, DOBs, addresses, and free-text clinical notes before they hit observability tooling.
- •
Building one giant prompt instead of explicit graph steps
A monolithic prompt is harder to test and harder to certify operationally. Keep extraction logic split across nodes so each stage can be measured and swapped independently.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit