CrewAI Tutorial (TypeScript): handling long documents for advanced developers

By Cyprian AaronsUpdated 2026-04-21
crewaihandling-long-documents-for-advanced-developerstypescript

This tutorial shows you how to build a CrewAI workflow in TypeScript that can ingest, chunk, summarize, and answer questions over long documents without blowing past model context limits. You need this when a single PDF, contract, policy pack, or research report is too large to feed into one prompt and you still want reliable retrieval and grounded answers.

What You'll Need

  • Node.js 18+
  • A TypeScript project with ts-node or tsx
  • CrewAI JS/TS package
  • OpenAI API key
  • pdf-parse for extracting text from PDFs
  • dotenv for environment variables
  • A long document in .pdf or .txt format

Install the dependencies:

npm install crewai openai pdf-parse dotenv
npm install -D typescript tsx @types/node

Set your environment variable:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

  1. Start by extracting text from the document and splitting it into manageable chunks. The point is not to preserve perfect formatting; it is to create stable units the agents can process independently.
import fs from "node:fs";
import pdf from "pdf-parse";

export async function loadDocument(path: string): Promise<string> {
  if (path.endsWith(".pdf")) {
    const buffer = fs.readFileSync(path);
    const data = await pdf(buffer);
    return data.text;
  }
  return fs.readFileSync(path, "utf8");
}

export function chunkText(text: string, chunkSize = 4000): string[] {
  const chunks: string[] = [];
  for (let i = 0; i < text.length; i += chunkSize) {
    chunks.push(text.slice(i, i + chunkSize));
  }
  return chunks;
}
  1. Define a summarizer agent that compresses each chunk into structured notes. For long documents, this is the first pass that turns raw text into something you can actually reason over.
import { Agent } from "crewai";

export const summarizer = new Agent({
  role: "Document Summarizer",
  goal: "Summarize document chunks into concise, factual notes",
  backstory:
    "You extract key facts, entities, dates, risks, obligations, and decisions from long enterprise documents.",
});
  1. Create a task per chunk and run them in sequence. Sequential processing keeps memory pressure predictable and makes debugging easier when one chunk produces bad output.
import { Task } from "crewai";

export function buildSummarizationTasks(chunks: string[], agent: Agent): Task[] {
  return chunks.map(
    (chunk, index) =>
      new Task({
        description: `Summarize chunk ${index + 1} of ${chunks.length}. Focus on facts, entities, dates, obligations, exceptions, and open questions.\n\n${chunk}`,
        expectedOutput:
          "A structured summary with bullet points for facts, risks, and notable terms.",
        agent,
      })
  );
}
  1. Add an aggregation agent that merges all chunk summaries into one coherent document brief. This step is where you recover global context across the whole file instead of treating each chunk as isolated noise.
import { Agent } from "crewai";

export const aggregator = new Agent({
  role: "Document Aggregator",
  goal: "Merge chunk summaries into a single authoritative brief",
  backstory:
    "You reconcile overlapping summaries, remove duplicates, and produce a compact but complete overview.",
});
  1. Wire everything together in a runnable script. This example uses CrewAI’s Crew with sequential execution so you can process long inputs safely and then ask follow-up questions against the distilled result.
import "dotenv/config";
import { Crew } from "crewai";
import { loadDocument, chunkText } from "./document-utils";
import { summarizer } from "./agents";
import { buildSummarizationTasks } from "./tasks";

async function main() {
  const text = await loadDocument("./long-document.pdf");
  const chunks = chunkText(text, 3500);

  const tasks = buildSummarizationTasks(chunks.slice(0, 8), summarizer);

  const crew = new Crew({
    agents: [summarizer],
    tasks,
    verbose: true,
    memory: false,
  });

  const result = await crew.kickoff();
  console.log(String(result));
}

main().catch(console.error);
  1. If you need Q&A over the processed document, feed the aggregated summary into a second crew run. That gives you much better answer quality than asking the model to search raw source text directly.
import { Agent, Task } from "crewai";

const qaAgent = new Agent({
  role: "Document Analyst",
  goal: "Answer questions using only provided document summaries",
});

const qaTask = new Task({
  description:
    "Using the following summary of the full document, answer: What are the top three operational risks?\n\nPASTE_AGGREGATED_SUMMARY_HERE",
  expectedOutput: "A grounded answer with direct references to the summary.",
  agent: qaAgent,
});

console.log(qaTask.description);

Testing It

Run the script against a known long document first, not production data. You want to verify that chunking is stable, summaries stay factual, and the final aggregated output does not invent details.

Check three things:

  • Each chunk produces a compact summary instead of echoing raw text.
  • The aggregation step removes duplicates and resolves repeated references.
  • Follow-up Q&A answers only use information present in the summaries.

If outputs get vague or hallucinated, reduce chunk size and make the summarization prompt more structured. In enterprise settings I usually add explicit fields like facts, risks, entities, and open_questions so downstream parsing stays deterministic.

Next Steps

  • Add embeddings plus vector search so you can retrieve only relevant chunks before summarization.
  • Replace naive character chunking with sentence-aware splitting for cleaner boundaries.
  • Add JSON schema validation on agent outputs so your pipeline fails fast when summaries drift off format.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides