How to Build a document extraction Agent Using AutoGen in TypeScript for healthcare
A document extraction agent in healthcare takes unstructured clinical documents — PDFs, scanned forms, discharge summaries, prior auth letters — and turns them into structured data your systems can use. That matters because downstream workflows like claims intake, care coordination, and chart review depend on accurate extraction, and manual review is slow, expensive, and error-prone.
Architecture
- •
Document ingestion layer
- •Accepts PDFs, images, or text from secure storage or a controlled upload service.
- •Normalizes file metadata like patient ID, document type, source system, and retention policy.
- •
OCR / text extraction layer
- •Converts scanned documents into text before the LLM sees them.
- •For healthcare, this should preserve page boundaries and confidence scores for auditability.
- •
AutoGen extraction agent
- •Uses
AssistantAgentto convert raw text into a strict JSON schema. - •Should be constrained to extract only approved fields such as member name, DOB, diagnosis codes, procedure codes, dates of service, and provider details.
- •Uses
- •
Validation and policy layer
- •Verifies schema shape, field formats, and business rules before data leaves the pipeline.
- •Rejects or routes low-confidence extractions to human review.
- •
Audit and storage layer
- •Stores prompts, model outputs, validation results, and document hashes.
- •Keeps PHI handling aligned with compliance requirements and internal audit trails.
Implementation
1) Install AutoGen for TypeScript and define the extraction contract
Use the TypeScript AutoGen package that exposes AssistantAgent and OpenAIChatCompletionClient. Keep the output schema narrow. In healthcare, broad schemas create garbage data fast.
npm install @autogenai/autogen openai zod
import { z } from "zod";
export const ExtractionSchema = z.object({
patientName: z.string().min(1),
dateOfBirth: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
documentType: z.enum(["discharge_summary", "prior_auth", "lab_result", "referral", "other"]),
encounterDate: z.string().regex(/^\d{4}-\d{2}-\d{2}$/).optional(),
providerName: z.string().optional(),
diagnosisCodes: z.array(z.string()).default([]),
procedureCodes: z.array(z.string()).default([]),
});
export type ExtractionResult = z.infer<typeof ExtractionSchema>;
2) Create an AutoGen assistant that returns structured JSON only
The core pattern is simple: instruct the agent to extract fields from a document payload and return valid JSON matching your schema. Use AssistantAgent with a strict system message and a chat completion client backed by your model provider.
import { AssistantAgent } from "@autogenai/autogen";
import { OpenAIChatCompletionClient } from "@autogenai/autogen/openai";
const client = new OpenAIChatCompletionClient({
model: "gpt-4o-mini",
apiKey: process.env.OPENAI_API_KEY!,
});
const extractor = new AssistantAgent({
name: "healthcare_document_extractor",
modelClient: client,
systemMessage: `
You extract structured data from healthcare documents.
Return ONLY valid JSON.
Do not invent values.
If a field is missing or unreadable, use null or an empty array where appropriate.
Extract only these fields:
patientName, dateOfBirth, documentType, encounterDate, providerName,
diagnosisCodes, procedureCodes.
`,
});
3) Run the extraction workflow and validate the output
Feed the agent plain text after OCR. Then validate with Zod before persisting anything. This is where you stop hallucinations from becoming production data.
import { ExtractionSchema } from "./schema";
type DocInput = {
documentId: string;
text: string;
};
export async function extractDocument(doc: DocInput) {
const result = await extractor.run([
{
role: "user",
content: `Extract fields from this healthcare document:\n\n${doc.text}`,
},
]);
const rawText = typeof result === "string" ? result : result.content;
const parsedJson = JSON.parse(rawText);
const validated = ExtractionSchema.safeParse(parsedJson);
if (!validated.success) {
return {
documentId: doc.documentId,
status: "needs_review",
errors: validated.error.flatten(),
rawText,
};
}
return {
documentId: doc.documentId,
status: "extracted",
data: validated.data,
rawText,
};
}
4) Add a human-review fallback for low-confidence or invalid records
In healthcare you do not want silent failure. If parsing fails or required identifiers are missing, route the record to a reviewer queue with the original text hash and model output attached.
function requiresReview(data: { patientName?: string; dateOfBirth?: string }) {
return !data.patientName || !data.dateOfBirth;
}
export async function processDocument(doc: DocInput) {
const extracted = await extractDocument(doc);
if (extracted.status === "needs_review") {
await queueForHumanReview({
documentId: doc.documentId,
reason: "schema_validation_failed",
payload: extracted,
});
return extracted;
}
if (requiresReview(extracted.data)) {
await queueForHumanReview({
documentId: doc.documentId,
reason: "missing_required_fields",
payload: extracted,
});
return { ...extracted, status: "needs_review" as const };
}
await saveToClinicalStore(extracted.data);
return extracted;
}
Production Considerations
- •
Data residency
- •Keep PHI in-region. If your organization requires US-only processing or specific cloud regions, enforce that at the storage layer and model endpoint layer.
- •Do not send raw documents to external services unless contracts, BAAs, and internal policy allow it.
- •
Auditability
- •Store document hashes, prompt versions, model version IDs, output JSON, validation errors, and reviewer actions.
- •You need traceability when a claim is disputed or a chart entry is questioned.
- •
Guardrails
- •Enforce schema validation before persistence.
- •Block free-form narrative output; extraction agents should not summarize unless explicitly required.
- •
Monitoring
- •Track invalid JSON rate, missing-field rate, reviewer override rate, and per-document-type accuracy.
- •Alert when extraction quality drops for a specific source system or template version.
Common Pitfalls
- •
Letting the agent infer missing clinical facts
- •Bad pattern: filling in diagnosis codes or dates based on context clues.
- •Fix it by telling the agent to use
nullor empty arrays when data is absent.
- •
Skipping validation because the model “usually works”
- •A single malformed response can poison downstream claims or EHR workflows.
- •Always validate with Zod or an equivalent schema before writing to storage.
- •
Ignoring template drift
- •Healthcare forms change often across facilities and payers.
- •Version your prompts by document type and monitor failure rates per template so you can retrain rules or update instructions quickly.
If you build this correctly, AutoGen becomes the orchestration layer for extraction while your own code handles compliance boundaries. That separation is what makes the agent usable in regulated healthcare systems instead of just looking good in a demo.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit