How to Build a document extraction Agent Using CrewAI in TypeScript for healthcare
A document extraction agent for healthcare takes unstructured clinical documents — referrals, discharge summaries, lab reports, prior auth forms — and turns them into structured data your systems can trust. That matters because downstream workflows like triage, coding, claims processing, and care coordination depend on accurate extraction, plus healthcare adds compliance, auditability, and data residency constraints that you cannot bolt on later.
Architecture
- •
Document intake layer
- •Accept PDFs, scanned images, or text files from secure storage or an internal upload service.
- •Normalize file paths and metadata before the agent touches them.
- •
Extraction crew
- •A
Crewwith one or moreAgents focused on document parsing. - •One agent can extract fields, another can validate against a schema or clinical rules.
- •A
- •
Tooling layer
- •Use CrewAI
Tools for OCR, PDF text extraction, PHI redaction checks, and schema validation. - •Keep tools deterministic where possible; don’t ask the LLM to do everything.
- •Use CrewAI
- •
Structured output contract
- •Define a strict JSON shape for fields like patient name, DOB, provider, diagnosis codes, dates, medications.
- •Enforce output validation before the result is stored or sent downstream.
- •
Audit and governance layer
- •Log every input document ID, model version, tool call, extracted fields, confidence score, and reviewer status.
- •Store artifacts in a region approved for healthcare workloads.
Implementation
1) Install dependencies and set up the project
For TypeScript projects using CrewAI’s Node SDK pattern, keep the runtime simple: document parsing libraries plus the CrewAI package.
npm install @crewai/crewai zod pdf-parse
npm install -D typescript ts-node @types/node
Create a strict tsconfig.json and run this in an environment where healthcare data is allowed to reside. If you are handling PHI/PII, make sure your storage bucket, logs, and model endpoint are all in the correct region.
2) Define the extraction schema and helper tools
You want the agent to emit a constrained payload. In healthcare, free-form JSON is how bad data gets into claims and EHR pipelines.
import { z } from "zod";
import fs from "node:fs/promises";
import pdfParse from "pdf-parse";
export const ClinicalDocumentSchema = z.object({
patientName: z.string().optional(),
dateOfBirth: z.string().optional(),
mrn: z.string().optional(),
documentType: z.string(),
encounterDate: z.string().optional(),
providerName: z.string().optional(),
diagnosisCodes: z.array(z.string()).default([]),
medications: z.array(z.string()).default([]),
allergies: z.array(z.string()).default([]),
summary: z.string(),
});
export type ClinicalDocument = z.infer<typeof ClinicalDocumentSchema>;
export async function extractTextFromPdf(filePath: string): Promise<string> {
const buffer = await fs.readFile(filePath);
const parsed = await pdfParse(buffer);
return parsed.text;
}
This schema is intentionally narrow. If your source document does not contain a field reliably, leave it optional instead of inventing values.
3) Build a CrewAI agent that extracts structured clinical data
The core pattern is one extraction agent with explicit instructions and a task that requires structured output. Use Agent, Task, and Crew directly.
import { Agent, Task, Crew } from "@crewai/crewai";
import { ClinicalDocumentSchema } from "./schema";
import { extractTextFromPdf } from "./tools";
async function main() {
const text = await extractTextFromPdf("./input/discharge-summary.pdf");
const extractor = new Agent({
role: "Clinical Document Extraction Specialist",
goal: "Extract accurate structured data from healthcare documents without fabricating missing fields.",
backstory:
"You process clinical documents for downstream administrative and care coordination workflows. You must preserve fidelity to source text.",
verbose: true,
allowDelegation: false,
// If your SDK version supports tools here:
// tools: [ ... ]
});
const task = new Task({
description: `
Extract structured fields from this clinical document.
Rules:
- Return only values supported by the source text.
- Do not infer diagnoses or medications not explicitly present.
- Preserve dates exactly as written when possible.
- If a field is missing, omit it or return null/empty array per schema.
- Output must match this JSON shape:
${ClinicalDocumentSchema.toString()}
Source document:
${text}
`,
expectedOutput: "Valid JSON matching the clinical document schema.",
agent: extractor,
outputJson: true,
outputPydantic: ClinicalDocumentSchema as any,
});
const crew = new Crew({
agents: [extractor],
tasks: [task],
verbose: true,
process: "sequential",
});
const result = await crew.kickoff();
console.log(result);
}
main().catch(console.error);
The important part here is not just calling an LLM. It is constraining the output contract so your pipeline can reject malformed results before they hit production systems.
4) Validate results before persistence
Never write raw model output directly to your database or integration bus. Validate first, then persist with trace metadata for auditability.
import { ClinicalDocumentSchema } from "./schema";
export function validateExtraction(rawResult: unknown) {
const parsed = ClinicalDocumentSchema.safeParse(rawResult);
if (!parsed.success) {
throw new Error(
`Extraction validation failed: ${JSON.stringify(parsed.error.flatten())}`
);
}
return parsed.data;
}
In practice you should also attach:
- •source document ID
- •hash of the input file
- •timestamp
- •model/provider name
- •reviewer override flag if human review was required
That gives you an audit trail when compliance asks how a field got into the system.
Production Considerations
- •
Data residency
- •Keep PHI inside approved regions only.
- •Verify every dependency in the chain — OCR service, LLM endpoint, vector store if used — complies with your residency requirements.
- •
Audit logging
- •Log input document IDs and extraction outputs separately from raw content.
- •Redact logs by default; never dump full clinical text into application logs.
- •
Human-in-the-loop review
- •Route low-confidence or high-risk documents to manual review.
- •This is mandatory for edge cases like poor scans, handwritten notes, or conflicting identifiers.
- •
Guardrails
- •Block unsupported fields such as diagnosis inference when not present in source text.
- •Add post-processing checks for MRN format, DOB plausibility, ICD code patterns, and date consistency across pages.
Common Pitfalls
- •
Letting the model infer missing clinical facts
- •Avoid this by hardening prompts and validating against a strict schema.
- •If the source says “rule out pneumonia,” do not extract “pneumonia” as a confirmed diagnosis.
- •
Skipping OCR quality checks
- •Bad scans produce bad extractions.
- •Add image quality thresholds and fallback to manual review when OCR confidence drops below your accepted floor.
- •
Treating healthcare documents like generic PDFs
- •A discharge summary is not just text; it carries regulatory risk and downstream operational impact.
- •Build separate handling for encounter notes, lab reports, referrals, prior auths, and claims attachments because each has different field expectations and compliance sensitivity.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit