LangChain Tutorial (TypeScript): chunking large documents for intermediate developers
This tutorial shows how to split large documents into chunks in TypeScript using LangChain, then inspect those chunks so you can feed them into retrieval, summarization, or RAG pipelines. You need this when a single document is too large for your model context window, or when you want better retrieval by breaking content into smaller, semantically useful pieces.
What You'll Need
- •Node.js 18+ and npm
- •A TypeScript project with
tsconfig.json - •
langchaininstalled - •
@langchain/openaiinstalled - •An OpenAI API key in
OPENAI_API_KEY - •A large text file to chunk, such as a policy doc, contract, handbook, or incident report
Step-by-Step
- •Start by installing the packages and setting up a minimal TypeScript project. The splitter lives in LangChain core, while embeddings and loaders come from the provider packages.
npm init -y
npm install langchain @langchain/openai
npm install -D typescript tsx @types/node
- •Load a document from disk and turn it into LangChain
Documentobjects. For this example, we’ll use a plain text file because it keeps the tutorial focused on chunking rather than parsing PDFs or HTML.
import { TextLoader } from "langchain/document_loaders/fs/text";
async function main() {
const loader = new TextLoader("./data/large-document.txt");
const docs = await loader.load();
console.log(`Loaded ${docs.length} document(s)`);
console.log(docs[0].pageContent.slice(0, 200));
}
main().catch(console.error);
- •Split the loaded document into overlapping chunks. Chunk size controls how much text goes into each piece, and overlap helps preserve context across boundaries so you don’t lose meaning at the edges.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { TextLoader } from "langchain/document_loaders/fs/text";
async function main() {
const loader = new TextLoader("./data/large-document.txt");
const docs = await loader.load();
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 150,
});
const chunks = await splitter.splitDocuments(docs);
console.log(`Created ${chunks.length} chunks`);
}
main().catch(console.error);
- •Inspect each chunk and keep metadata attached. In production systems, metadata is what lets you trace a chunk back to its source document, page, section, or customer record.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { TextLoader } from "langchain/document_loaders/fs/text";
async function main() {
const loader = new TextLoader("./data/large-document.txt");
const docs = await loader.load();
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 150,
});
const chunks = await splitter.splitDocuments(docs);
chunks.slice(0, 3).forEach((chunk, index) => {
console.log(`\nChunk ${index + 1}`);
console.log(chunk.metadata);
console.log(chunk.pageContent.slice(0, 300));
});
}
main().catch(console.error);
- •Write the chunks to disk so they can be reused by your ingestion pipeline. This is useful when you want to precompute embeddings later instead of splitting on every request.
import { mkdirSync, writeFileSync } from "node:fs";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { TextLoader } from "langchain/document_loaders/fs/text";
async function main() {
const loader = new TextLoader("./data/large-document.txt");
const docs = await loader.load();
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 150,
});
const chunks = await splitter.splitDocuments(docs);
mkdirSync("./output", { recursive: true });
writeFileSync("./output/chunks.json", JSON.stringify(chunks, null, 2));
console.log("Saved chunks to ./output/chunks.json");
}
main().catch(console.error);
- •If you’re building a RAG pipeline next, embed the chunks after splitting them. Chunking first gives embeddings cleaner units of meaning, which usually improves retrieval quality over embedding entire documents.
import { OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { TextLoader } from "langchain/document_loaders/fs/text";
async function main() {
process.env.OPENAI_API_KEY ||= "";
const loader = new TextLoader("./data/large-document.txt");
const docs = await loader.load();
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 150,
});
const chunks = await splitter.splitDocuments(docs);
const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const vectors = await embeddings.embedDocuments(
chunks.map((chunk) => chunk.pageContent)
);
console.log(`Embedded ${vectors.length} chunks`);
}
main().catch(console.error);
Testing It
Run the script with npx tsx your-file.ts and confirm that the output shows multiple chunks instead of one large document. Check that each chunk has roughly the size you configured and that overlap appears between adjacent chunks when you compare their endings and beginnings.
If your source file has headings or paragraphs, verify that the splitter keeps related text together instead of slicing sentences in awkward places. For a stronger test, embed two similar documents and confirm that search results return smaller relevant chunks rather than huge blobs of text.
Next Steps
- •Add a vector store like pgvector, Pinecone, or Chroma and store these chunks for retrieval
- •Switch from plain text loading to PDF or HTML loaders for real enterprise documents
- •Tune
chunkSizeandchunkOverlapper document type instead of using one global setting
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit