CrewAI Tutorial (TypeScript): chunking large documents for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
crewaichunking-large-documents-for-intermediate-developerstypescript

This tutorial shows how to split a large document into smaller, token-safe chunks in TypeScript, then feed those chunks into CrewAI tasks without blowing context limits. You need this when you’re processing contracts, policies, claims, or long reports that are too large for a single agent run.

What You'll Need

  • Node.js 18+
  • A TypeScript project with ts-node or a build step
  • @crewai/crewai
  • dotenv
  • An OpenAI API key in .env
  • A large text file to test with, like ./data/policy.txt

Install the packages:

npm install @crewai/crewai dotenv
npm install -D typescript ts-node @types/node

Create a .env file:

OPENAI_API_KEY=your_api_key_here

Step-by-Step

  1. Start by creating a small chunking utility that splits text by words and keeps overlap between chunks. Overlap matters because important clauses often span boundaries, especially in legal and insurance documents.
export function chunkText(
  text: string,
  chunkSize = 800,
  overlap = 100
): string[] {
  const words = text.split(/\s+/).filter(Boolean);
  const chunks: string[] = [];

  let start = 0;
  while (start < words.length) {
    const end = Math.min(start + chunkSize, words.length);
    chunks.push(words.slice(start, end).join(" "));
    if (end === words.length) break;
    start = Math.max(0, end - overlap);
  }

  return chunks;
}
  1. Next, load your source document from disk and chunk it. Keep this separate from CrewAI so you can reuse the same chunker for different pipelines later.
import fs from "node:fs";
import path from "node:path";
import { chunkText } from "./chunkText";

const filePath = path.resolve("./data/policy.txt");
const documentText = fs.readFileSync(filePath, "utf-8");

const chunks = chunkText(documentText, 700, 120);

console.log(`Loaded ${documentText.length} characters`);
console.log(`Created ${chunks.length} chunks`);
console.log(chunks[0]?.slice(0, 300));
  1. Now wire the chunks into CrewAI. The pattern here is simple: one agent reviews each chunk and extracts structured notes, which gives you consistent output across very large inputs.
import "dotenv/config";
import { Agent, Task, Crew } from "@crewai/crew";

const analyst = new Agent({
  role: "Document Analyst",
  goal: "Extract key obligations and risks from each document chunk",
  backstory: "You review long enterprise documents and produce concise structured notes.",
});

async function analyzeChunk(chunk: string) {
  const task = new Task({
    description: `Review this document chunk and extract:
- key obligations
- deadlines or dates
- risks or exceptions
Return bullet points only.

Chunk:
${chunk}`,
    agent: analyst,
    expectedOutput: "Bullet-point summary of the chunk",
  });

  const crew = new Crew({
    agents: [analyst],
    tasks: [task],
  });

  return await crew.kickoff();
}
  1. Process every chunk sequentially so you stay within rate limits and keep memory usage predictable. For production workloads, this is easier to monitor than firing off everything at once.
import fs from "node:fs";
import path from "node:path";
import { chunkText } from "./chunkText";

async function main() {
  const filePath = path.resolve("./data/policy.txt");
  const documentText = fs.readFileSync(filePath, "utf-8");
  const chunks = chunkText(documentText, 700, 120);

  const results: string[] = [];

  for (let i = 0; i < chunks.length; i++) {
    console.log(`Processing chunk ${i + 1}/${chunks.length}`);
    const result = await analyzeChunk(chunks[i]);
    results.push(String(result));
  }

  fs.writeFileSync("./output/chunk-notes.txt", results.join("\n\n---\n\n"));
}

main().catch(console.error);
  1. Finally, add a second pass that merges all chunk summaries into one consolidated report. This is where CrewAI becomes useful beyond extraction: you turn many local summaries into one global answer.
const synthesizer = new Agent({
  role: "Senior Document Reviewer",
  goal: "Combine multiple chunk summaries into one concise report",
});

async function synthesizeReport(notes: string[]) {
  const task = new Task({
    description: `Combine these notes into one report with:
- top risks
- top obligations
- notable dates
- open questions

Notes:
${notes.join("\n\n")}`,
    agent: synthesizer,
    expectedOutput: "A consolidated report",
  });

  const crew = new Crew({
    agents: [synthesizer],
    tasks: [task],
  });

  return await crew.kickoff();
}

Testing It

Run the script against a real document that is larger than your model’s comfortable context window. If your chunking is working, you should see multiple chunk logs and no truncation errors.

Check the output file for two things:

  • Each chunk summary should be focused on local content only.
  • The final synthesized report should read like a merged analysis, not a copy of one random section.

If the summaries look noisy, reduce chunkSize. If they miss cross-section context, increase overlap.

Next Steps

  • Add token-based chunk sizing using tiktoken or a tokenizer compatible with your model.
  • Store intermediate chunk outputs in Postgres or S3 so failed runs can resume.
  • Add metadata per chunk (page, section, offset) so downstream agents can cite sources precisely.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides