LangChain Tutorial (TypeScript): chunking large documents for beginners
This tutorial shows you how to split a large document into smaller chunks in TypeScript using LangChain, then inspect the chunks so you can feed them into embeddings, retrieval, or LLM pipelines. You need this any time your source text is too large for model context windows or you want more precise search and retrieval over long content.
What You'll Need
- •Node.js 18+
- •A TypeScript project
- •
langchaininstalled - •
@langchain/openaiinstalled if you want to embed or summarize chunks later - •An OpenAI API key in
OPENAI_API_KEY - •A large text file to test with, such as a PDF-extracted
.txtfile or markdown export
Install the packages:
npm install langchain @langchain/openai dotenv
npm install -D typescript tsx @types/node
Step-by-Step
- •Start with a plain text loader. For beginners, the cleanest path is to load a
.txtfile first so you can focus on chunking instead of document parsing. LangChain returnsDocumentobjects, which is what the splitter expects.
import "dotenv/config";
import { TextLoader } from "langchain/document_loaders/fs/text";
async function main() {
const loader = new TextLoader("./data/large-document.txt");
const docs = await loader.load();
console.log("Loaded documents:", docs.length);
console.log("First doc preview:", docs[0].pageContent.slice(0, 200));
}
main().catch(console.error);
- •Split the loaded document into chunks with overlap. The important knobs are
chunkSizeandchunkOverlap. For most beginner use cases, start with 800–1200 characters and 100–200 characters of overlap.
import "dotenv/config";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
async function main() {
const loader = new TextLoader("./data/large-document.txt");
const docs = await loader.load();
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 150,
separators: ["\n\n", "\n", " ", ""],
});
const chunks = await splitter.splitDocuments(docs);
console.log("Chunks created:", chunks.length);
}
main().catch(console.error);
- •Inspect the resulting chunk metadata and content boundaries. This is where most people skip too fast and end up debugging bad retrieval later. Print the first few chunks so you can confirm the split points make sense.
import "dotenv/config";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
async function main() {
const loader = new TextLoader("./data/large-document.txt");
const docs = await loader.load();
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 150,
});
const chunks = await splitter.splitDocuments(docs);
chunks.slice(0, 3).forEach((chunk, index) => {
console.log(`--- Chunk ${index + 1} ---`);
console.log("Metadata:", chunk.metadata);
console.log(chunk.pageContent.slice(0, 300));
console.log();
});
}
main().catch(console.error);
- •Save the chunks for downstream use. In real systems, you usually pass these into embeddings and a vector store rather than keeping them in memory. Here we’ll write them to disk as JSON so you can verify structure before moving on.
import "dotenv/config";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { writeFile } from "node:fs/promises";
async function main() {
const loader = new TextLoader("./data/large-document.txt");
const docs = await loader.load();
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 150,
});
const chunks = await splitter.splitDocuments(docs);
await writeFile(
"./data/chunks.json",
JSON.stringify(chunks, null, 2),
"utf8"
);
console.log("Saved chunks to ./data/chunks.json");
}
main().catch(console.error);
- •If your next step is retrieval, embed the chunks directly after splitting. This is the standard pipeline for RAG: load, split, embed, store, retrieve. The example below uses OpenAI embeddings so you can plug it into a vector store next.
import "dotenv/config";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { OpenAIEmbeddings } from "@langchain/openai";
async function main() {
const loader = new TextLoader("./data/large-document.txt");
const docs = await loader.load();
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 150,
});
const chunks = await splitter.splitDocuments(docs);
const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const vectors = await embeddings.embedDocuments(
chunks.map((chunk) => chunk.pageContent)
);
console.log("Embedded chunks:", vectors.length);
console.log("Vector dimension:", vectors[0].length);
}
main().catch(console.error);
Testing It
Run the script against a real document that has headings, paragraphs, and some repeated terms. You want to confirm that chunk boundaries preserve meaning instead of slicing every few lines arbitrarily.
Check three things:
- •The number of chunks is reasonable for the document size
- •Chunks overlap enough that important context isn’t lost
- •The first and last few characters of adjacent chunks look natural
If your output looks noisy or too fragmented, increase chunkSize. If retrieval later misses context across boundaries, increase chunkOverlap.
Next Steps
- •Add a vector store like Pinecone, pgvector, or Chroma and persist the embedded chunks
- •Try
MarkdownHeaderTextSplitterfor structured documents with headings - •Build a retrieval chain that answers questions over the chunked content
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit