LangChain Tutorial (TypeScript): chunking large documents for beginners

By Cyprian AaronsUpdated 2026-04-21
langchainchunking-large-documents-for-beginnerstypescript

This tutorial shows you how to split a large document into smaller chunks in TypeScript using LangChain, then inspect the chunks so you can feed them into embeddings, retrieval, or LLM pipelines. You need this any time your source text is too large for model context windows or you want more precise search and retrieval over long content.

What You'll Need

  • Node.js 18+
  • A TypeScript project
  • langchain installed
  • @langchain/openai installed if you want to embed or summarize chunks later
  • An OpenAI API key in OPENAI_API_KEY
  • A large text file to test with, such as a PDF-extracted .txt file or markdown export

Install the packages:

npm install langchain @langchain/openai dotenv
npm install -D typescript tsx @types/node

Step-by-Step

  1. Start with a plain text loader. For beginners, the cleanest path is to load a .txt file first so you can focus on chunking instead of document parsing. LangChain returns Document objects, which is what the splitter expects.
import "dotenv/config";
import { TextLoader } from "langchain/document_loaders/fs/text";

async function main() {
  const loader = new TextLoader("./data/large-document.txt");
  const docs = await loader.load();

  console.log("Loaded documents:", docs.length);
  console.log("First doc preview:", docs[0].pageContent.slice(0, 200));
}

main().catch(console.error);
  1. Split the loaded document into chunks with overlap. The important knobs are chunkSize and chunkOverlap. For most beginner use cases, start with 800–1200 characters and 100–200 characters of overlap.
import "dotenv/config";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

async function main() {
  const loader = new TextLoader("./data/large-document.txt");
  const docs = await loader.load();

  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 150,
    separators: ["\n\n", "\n", " ", ""],
  });

  const chunks = await splitter.splitDocuments(docs);

  console.log("Chunks created:", chunks.length);
}

main().catch(console.error);
  1. Inspect the resulting chunk metadata and content boundaries. This is where most people skip too fast and end up debugging bad retrieval later. Print the first few chunks so you can confirm the split points make sense.
import "dotenv/config";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

async function main() {
  const loader = new TextLoader("./data/large-document.txt");
  const docs = await loader.load();

  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 150,
  });

  const chunks = await splitter.splitDocuments(docs);

  chunks.slice(0, 3).forEach((chunk, index) => {
    console.log(`--- Chunk ${index + 1} ---`);
    console.log("Metadata:", chunk.metadata);
    console.log(chunk.pageContent.slice(0, 300));
    console.log();
  });
}

main().catch(console.error);
  1. Save the chunks for downstream use. In real systems, you usually pass these into embeddings and a vector store rather than keeping them in memory. Here we’ll write them to disk as JSON so you can verify structure before moving on.
import "dotenv/config";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { writeFile } from "node:fs/promises";

async function main() {
  const loader = new TextLoader("./data/large-document.txt");
  const docs = await loader.load();

  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 150,
  });

  const chunks = await splitter.splitDocuments(docs);

  await writeFile(
    "./data/chunks.json",
    JSON.stringify(chunks, null, 2),
    "utf8"
  );

  console.log("Saved chunks to ./data/chunks.json");
}

main().catch(console.error);
  1. If your next step is retrieval, embed the chunks directly after splitting. This is the standard pipeline for RAG: load, split, embed, store, retrieve. The example below uses OpenAI embeddings so you can plug it into a vector store next.
import "dotenv/config";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { OpenAIEmbeddings } from "@langchain/openai";

async function main() {
  const loader = new TextLoader("./data/large-document.txt");
  const docs = await loader.load();

  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 150,
  });

  const chunks = await splitter.splitDocuments(docs);
  const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });

  const vectors = await embeddings.embedDocuments(
    chunks.map((chunk) => chunk.pageContent)
  );

   console.log("Embedded chunks:", vectors.length);
   console.log("Vector dimension:", vectors[0].length);
}

main().catch(console.error);

Testing It

Run the script against a real document that has headings, paragraphs, and some repeated terms. You want to confirm that chunk boundaries preserve meaning instead of slicing every few lines arbitrarily.

Check three things:

  • The number of chunks is reasonable for the document size
  • Chunks overlap enough that important context isn’t lost
  • The first and last few characters of adjacent chunks look natural

If your output looks noisy or too fragmented, increase chunkSize. If retrieval later misses context across boundaries, increase chunkOverlap.

Next Steps

  • Add a vector store like Pinecone, pgvector, or Chroma and persist the embedded chunks
  • Try MarkdownHeaderTextSplitter for structured documents with headings
  • Build a retrieval chain that answers questions over the chunked content

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides