LlamaIndex Tutorial (TypeScript): chunking large documents for advanced developers

By Cyprian AaronsUpdated 2026-04-21
llamaindexchunking-large-documents-for-advanced-developerstypescript

This tutorial shows how to split large documents into retrieval-friendly chunks with LlamaIndex in TypeScript, then inspect and tune the output for production use. You need this when your source files are too large for reliable embedding, retrieval quality drops, or you want deterministic chunk boundaries for legal, financial, or policy documents.

What You'll Need

  • Node.js 18+
  • A TypeScript project with ts-node or a build step
  • llamaindex installed
  • An OpenAI API key set as OPENAI_API_KEY
  • A large text file to test with, such as:
    • annual reports
    • policy PDFs converted to text
    • long internal docs
  • Basic familiarity with async/await and filesystem access

Step-by-Step

  1. Start by installing the package and setting up a minimal TypeScript runtime. For chunking alone, you do not need a full vector store yet.
npm install llamaindex
npm install -D typescript ts-node @types/node
  1. Load a large document from disk and convert it into a Document. In real systems, this is where you normalize whitespace, strip boilerplate, or pre-clean OCR noise before chunking.
import fs from "node:fs/promises";
import { Document } from "llamaindex";

async function main() {
  const text = await fs.readFile("./data/large-document.txt", "utf8");

  const document = new Document({
    text,
    metadata: {
      source: "large-document.txt",
      department: "risk",
    },
  });

  console.log("Loaded characters:", document.text.length);
}

main().catch(console.error);
  1. Configure the chunking strategy explicitly. The default settings are fine for demos, but advanced use cases need control over chunk size and overlap so retrieval stays stable across long sections.
import { Settings, SentenceSplitter } from "llamaindex";

Settings.chunkSize = 1024;
Settings.chunkOverlap = 128;
Settings.nodeParser = new SentenceSplitter({
  chunkSize: Settings.chunkSize,
  chunkOverlap: Settings.chunkOverlap,
});

console.log("Chunk size:", Settings.chunkSize);
console.log("Chunk overlap:", Settings.chunkOverlap);
  1. Parse the document into nodes and inspect the resulting chunks. This is the point where you validate whether headings, paragraphs, and semantic boundaries are being preserved well enough for downstream retrieval.
import fs from "node:fs/promises";
import { Document, SentenceSplitter } from "llamaindex";

async function main() {
  const text = await fs.readFile("./data/large-document.txt", "utf8");
  const document = new Document({ text });

  const splitter = new SentenceSplitter({
    chunkSize: 1024,
    chunkOverlap: 128,
  });

  const nodes = splitter.getNodesFromDocuments([document]);

  console.log("Total chunks:", nodes.length);
  console.log("First chunk preview:");
  console.log(nodes[0]?.getContent().slice(0, 500));
}

main().catch(console.error);
  1. If your documents have structure like headings or sections, keep metadata attached to each node so retrieval can filter later by source or section. This matters in banking and insurance workflows where one document can contain multiple policies or product lines.
import fs from "node:fs/promises";
import { Document, SentenceSplitter } from "llamaindex";

async function main() {
  const text = await fs.readFile("./data/large-document.txt", "utf8");

  const document = new Document({
    text,
    metadata: {
      source: "large-document.txt",
      docType: "policy",
      version: "2026-01",
    },
  });

  const splitter = new SentenceSplitter({
    chunkSize: 800,
    chunkOverlap: 100,
  });

  const nodes = splitter.getNodesFromDocuments([document]);

  console.log(
    nodes.slice(0, 3).map((node) => ({
      id: node.id_,
      preview: node.getContent().slice(0, 120),
      metadata: node.metadata,
    })),
  );
}

main().catch(console.error);
  1. Once the chunks look right, persist them in memory for indexing or pass them directly into a vector index later. For now, the important part is verifying that your chunking settings produce consistent node sizes and readable boundaries.
import fs from "node:fs/promises";
import { Document, SentenceSplitter } from "llamaindex";

async function main() {
  const text = await fs.readFile("./data/large-document.txt", "utf8");
  const document = new Document({ text });

  const splitter = new SentenceSplitter({
    chunkSize: 1200,
    chunkOverlap: 150,
    paragraphSeparator: "\n\n",
    sentenceSeparator: ". ",
    secondaryChunkingRegex: "[^,\n]+[,\\n]?",
  });

  const nodes = splitter.getNodesFromDocuments([document]);

  for (const [index, node] of nodes.entries()) {
    console.log(`Chunk ${index + 1}: ${node.getContent().length} chars`);
  }
}

main().catch(console.error);

Testing It

Run the script against a few real documents with different structures. A good test set includes one dense legal-style document, one narrative report, and one badly formatted OCR export.

Check that chunks are not too small to be useful and not so large that they exceed your embedding model’s sweet spot. In practice, you want chunks that preserve meaning without crossing too many topic boundaries.

Inspect overlap behavior manually by comparing adjacent chunks. If repeated context is too high, reduce overlap; if answers are getting split across boundaries, increase it slightly.

If you plan to index these chunks next, verify that metadata survives intact on every node. That is what lets you filter by source later instead of treating every chunk as anonymous text.

Next Steps

  • Add VectorStoreIndex on top of these nodes and test retrieval quality with real queries.
  • Compare SentenceSplitter with structure-aware parsers for markdown or HTML sources.
  • Build a small evaluation script that measures hit rate across different chunkSize and chunkOverlap values.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides