How to Build a document extraction Agent Using LangGraph in TypeScript for healthcare

By Cyprian AaronsUpdated 2026-04-21
document-extractionlanggraphtypescripthealthcare

A document extraction agent for healthcare takes unstructured files like referral letters, discharge summaries, lab reports, and prior authorization forms, then turns them into structured JSON your downstream systems can use. That matters because healthcare operations still run on documents, and the difference between a usable record and a missed field is often a delayed claim, a broken care workflow, or a compliance issue.

Architecture

A production-grade healthcare extraction agent needs these pieces:

  • Ingestion layer

    • Accepts PDFs, scans, DOCX, and images from secure storage or an internal upload service.
    • Normalizes file metadata like patient ID, source system, facility, and retention policy.
  • OCR / text extraction layer

    • Converts scanned documents into text.
    • Preserves page numbers and bounding boxes when possible for auditability.
  • LangGraph orchestration

    • Routes the document through extraction, validation, and escalation steps.
    • Keeps the workflow deterministic enough for regulated environments.
  • Structured extraction model

    • Produces strict JSON matching a healthcare schema.
    • Extracts entities such as patient name, DOB, MRN, diagnosis codes, medications, dates of service, and provider details.
  • Validation and compliance layer

    • Checks required fields, date formats, ICD-10/CPT patterns, and PHI handling rules.
    • Flags low-confidence outputs for human review.
  • Persistence and audit trail

    • Stores raw input references, extracted output, confidence scores, model version, and reviewer actions.
    • Supports data residency constraints by keeping records in-region.

Implementation

1) Define the extraction state and schema

Start with a strict state object. In healthcare you want typed outputs because downstream systems should not guess what dob means or whether mrn is missing.

import { z } from "zod";

export const ExtractionSchema = z.object({
  patientName: z.string().optional(),
  dateOfBirth: z.string().optional(),
  mrn: z.string().optional(),
  encounterDate: z.string().optional(),
  diagnosisCodes: z.array(z.string()).default([]),
  medications: z.array(z.string()).default([]),
  providerName: z.string().optional(),
  sourceDocumentType: z.string().optional(),
});

export type ExtractionResult = z.infer<typeof ExtractionSchema>;

export type AgentState = {
  filePath: string;
  rawText?: string;
  result?: ExtractionResult;
  confidence?: number;
  needsReview?: boolean;
};

2) Build nodes for OCR/text loading, extraction, and validation

Use LangGraph’s StateGraph to wire the workflow. The pattern below uses actual LangGraph APIs: StateGraph, .addNode(), .addEdge(), .addConditionalEdges(), and .compile().

import { StateGraph, START, END } from "@langchain/langgraph";
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";
import { zodToJsonSchema } from "zod-to-json-schema";
import { ExtractionSchema } from "./schema.js";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

async function loadDocument(state: AgentState): Promise<Partial<AgentState>> {
  // Replace with OCR/PDF parsing in production.
  const rawText = await Bun.file(state.filePath).text();
  return { rawText };
}

async function extractFields(state: AgentState): Promise<Partial<AgentState>> {
  const prompt = `
Extract structured healthcare data from this document.
Return only valid JSON matching this schema:
${JSON.stringify(zodToJsonSchema(ExtractionSchema), null, 2)}

Document:
${state.rawText}
`;

  const response = await llm.invoke([new HumanMessage(prompt)]);
  const parsed = ExtractionSchema.parse(JSON.parse(response.content as string));

  return {
    result: parsed,
    confidence: estimateConfidence(parsed),
    needsReview: false,
  };
}

async function validateExtraction(state: AgentState): Promise<Partial<AgentState>> {
  const missingCritical =
    !state.result?.patientName ||
    !state.result?.dateOfBirth ||
    !state.result?.mrn;

  return {
    needsReview: Boolean(missingCritical || (state.confidence ?? 0) < 0.85),
  };
}

function estimateConfidence(result: ExtractionResult): number {
  const fields = [
    result.patientName,
    result.dateOfBirth,
    result.mrn,
    result.encounterDate,
    result.providerName,
  ];
  

Continue with the graph wiring:

function estimateConfidence(result: ExtractionResult): number {
    const filled = fields.filter(Boolean).length;
    return filled / fields.length;
}

const graph = new StateGraph<AgentState>()
  .addNode("loadDocument", loadDocument)
const graph = new StateGraph<AgentState>()
 .addNode("loadDocument", loadDocument)
 .addNode("extractFields", extractFields)
 .addNode("validateExtraction", validateExtraction)
 .addEdge(START, "loadDocument")
 .addEdge("loadDocument", "extractFields")
 .addEdge("extractFields", "validateExtraction")
 .addConditionalEdges("validateExtraction", (state) =>
   state.needsReview ? "review" : "done",
   {
     review: END,
     done: END,
   }
 );

const app = graph.compile();

Why this pattern works

The graph is small on purpose. In regulated workflows you want every step visible:

  • Load the document
  • Extract structured data
  • Validate against business rules
  • Escalate if confidence is low

That makes it easier to audit than a single opaque prompt that tries to do everything at once.

3) Run the agent and persist outputs with audit metadata

You should persist both the extracted payload and metadata about how it was produced. For healthcare teams this is not optional; you need traceability for reviews, disputes, and compliance audits.

async function main() {
  const finalState = await app.invoke({
    filePath: "/secure/inbox/referral-letter.pdf",
    needsReview: false,
    confidence: undefined,
    rawText: undefined,
    result: undefined,
  });

  
async function main() {
   const finalState = await app.invoke({
     filePath: "/secure/inbox/referral-letter.pdf",
     needsReview: false,
     confidence: undefined,
     rawText: undefined,
     result: undefined,
   });

   console.log({
     extracted: finalState.result,
     confidence: finalState.confidence,
     reviewRequired: finalState.needsReview,
   });
}

main();

In production, write that output to an append-only audit store with:

  • document hash
  • tenant/facility ID
  • model name and version
  • timestamp
  • reviewer identity if human approval happens

Production Considerations

  • Data residency

You cannot casually send PHI across regions. Pin your OCR service, LLM endpoint, logs, and object storage to the same approved region as the covered entity’s policy requires.

  • Audit logging

Log every state transition in the graph. Keep raw prompts out of general-purpose logs unless they are redacted; otherwise you create a second PHI problem while solving the first one.

  • Human-in-the-loop fallback

Route low-confidence extractions to a clinical ops queue or HIM team. Use needsReview as a hard gate for claims submission or EHR write-back.

  • Guardrails

Validate outputs against regexes and controlled vocabularies before persistence:

  • ICD-10 format checks
  • MRN format checks per facility
  • Date normalization to ISO-8601
  • Allowed document types only

Common Pitfalls

  1. Using free-form LLM output without schema enforcement

    If you skip Zod parsing or similar validation, bad JSON will leak into claims or EHR workflows. Always parse into a strict schema before anything downstream consumes it.

  2. Treating OCR text as ground truth

    Scanned documents are noisy. Preserve page references and confidence scores so reviewers can verify uncertain fields quickly instead of re-reading entire packets.

  3. Ignoring PHI boundaries in logs and telemetry

    Debug logs often become shadow data stores. Redact names, MRNs, DOBs, addresses, and free-text clinical notes before they hit observability tooling.

  4. Building one giant prompt instead of explicit graph steps

    A monolithic prompt is harder to test and harder to certify operationally. Keep extraction logic split across nodes so each stage can be measured and swapped independently.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides