How to Build a document extraction Agent Using LlamaIndex in TypeScript for healthcare

By Cyprian AaronsUpdated 2026-04-21
document-extractionllamaindextypescripthealthcare

A document extraction agent for healthcare takes unstructured files like referrals, lab reports, discharge summaries, and prior authorizations and turns them into structured data you can route into downstream systems. That matters because healthcare ops runs on documents, and the difference between a usable record and a missed field is often a delayed claim, an incorrect triage decision, or a compliance issue.

Architecture

  • Document ingestion layer

    • Pull PDFs, scanned images, and text files from S3, SharePoint, EHR exports, or secure uploads.
    • Normalize file metadata: patient ID, encounter ID, source system, timestamp.
  • Text extraction layer

    • Use OCR for scanned PDFs before sending content to the LLM.
    • Preserve page boundaries and source offsets for auditability.
  • LlamaIndex extraction pipeline

    • Use Document, SentenceSplitter, and an LLM-backed extractor.
    • Convert raw text into structured entities like diagnosis, medications, dates, provider names, and CPT/ICD codes.
  • Validation and guardrails

    • Enforce schema checks with Zod or similar validation.
    • Reject incomplete or low-confidence outputs instead of auto-writing bad data into clinical systems.
  • Persistence and audit store

    • Store extracted JSON plus source citations.
    • Keep immutable logs for compliance review and incident investigation.
  • Integration layer

    • Push validated output to FHIR resources, claims workflows, case management tools, or human review queues.

Implementation

1) Install dependencies and define the extraction schema

For healthcare work, start with a strict schema. You want the model to extract only what you need and nothing else.

npm install llamaindex zod
import { z } from "zod";

export const ClinicalDocSchema = z.object({
  patientName: z.string().optional(),
  dateOfBirth: z.string().optional(),
  encounterDate: z.string().optional(),
  providerName: z.string().optional(),
  diagnosis: z.array(z.string()).default([]),
  medications: z.array(z.string()).default([]),
  allergies: z.array(z.string()).default([]),
  followUpInstructions: z.string().optional(),
});

export type ClinicalDoc = z.infer<typeof ClinicalDocSchema>;

This schema is intentionally conservative. In healthcare extraction pipelines, missing optional fields are better than hallucinated values.

2) Load documents with LlamaIndex

Use SimpleDirectoryReader for local development. In production you usually replace this with a secure ingestion service that pulls from encrypted storage.

import { SimpleDirectoryReader } from "llamaindex";

async function loadDocuments() {
  const reader = new SimpleDirectoryReader();
  const documents = await reader.loadData({ directoryPath: "./healthcare-docs" });
  return documents;
}

If your input is scanned PDFs, do OCR first. LlamaIndex can index text; it does not magically solve image-to-text extraction by itself.

3) Extract structured data with an LLM-backed query engine

A practical pattern in TypeScript is to create a VectorStoreIndex, then query it with an explicit extraction prompt. The key is to force the model into a JSON-shaped answer that matches your schema.

import {
  Document,
  VectorStoreIndex,
} from "llamaindex";
import { ClinicalDocSchema } from "./schema";

async function extractClinicalData(rawText: string) {
  const doc = new Document({ text: rawText });

  const index = await VectorStoreIndex.fromDocuments([doc]);
  const queryEngine = index.asQueryEngine();

  const response = await queryEngine.query({
    query:
      `Extract the following fields from this healthcare document:
      patientName, dateOfBirth, encounterDate, providerName,
      diagnosis[], medications[], allergies[], followUpInstructions.
      Return valid JSON only. If a field is missing, omit it.`,
  });

  const parsed = JSON.parse(response.response);
  return ClinicalDocSchema.parse(parsed);
}

That pattern works because LlamaIndex handles retrieval and context assembly while your schema enforces output shape. In real deployments you should also attach source citations per field if your workflow needs audit traceability.

4) Wire the agent into a controlled processing flow

Do not let the extractor write directly to your EHR or claims system. Put validation and human review in the middle.

import { readFileSync } from "node:fs";

async function main() {
  const rawText = readFileSync("./healthcare-docs/discharge-summary.txt", "utf-8");
  const extracted = await extractClinicalData(rawText);

  console.log("Validated extraction:", extracted);

  // Example next step:
  // if confidence low or critical fields missing -> send to manual review queue
}

main().catch((err) => {
  console.error(err);
});

The production version should add:

  • confidence thresholds
  • redaction before logging
  • PHI-safe transport
  • immutable audit events for every request and response

Production Considerations

  • Compliance

    • Treat every input as PHI until proven otherwise.
    • Encrypt data in transit and at rest.
    • Restrict access by tenant, environment, and role.
    • Make sure your vendor contracts cover HIPAA responsibilities if you process protected health information.
  • Auditability

    • Persist the original document hash, extracted JSON, model version, prompt version, and timestamp.
    • Store field-level provenance where possible so reviewers can trace each value back to source text.
    • Keep an immutable event log for every extraction run.
  • Data residency

    • Keep processing inside approved regions if your organization has residency requirements.
    • Avoid sending PHI across borders through third-party observability tools or unmanaged SaaS logs.
    • Pin vector stores and object storage to the same region as your workload when policy requires it.
  • Guardrails

RiskControl
Hallucinated clinical valuesStrict schema validation + reject unknown fields
Logging PHIRedaction middleware + no raw document logs
Bad downstream writesHuman approval for high-risk fields
Model driftVersion prompts and regression test against golden documents

Common Pitfalls

  1. Using free-form prompts without schema validation

    • Problem: the model returns plausible but wrong values.
    • Fix: parse output with Zod or another validator before persisting anything.
  2. Skipping OCR quality checks

    • Problem: garbage text in means garbage extraction out.
    • Fix: measure OCR confidence and route low-quality scans to manual review before indexing.
  3. Treating audit logs like debug logs

    • Problem: developers accidentally dump PHI into application logs.
    • Fix: log identifiers and hashes only; never log raw document content unless it is explicitly approved and protected.

A healthcare document extraction agent is not just an NLP wrapper. It is a controlled data pipeline that has to be accurate enough for operations, strict enough for compliance, and observable enough for audits. Build it like something that will be reviewed by security, legal, and clinical operations — because it will be.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides