LangGraph Tutorial (TypeScript): chunking large documents for intermediate developers
This tutorial builds a LangGraph workflow in TypeScript that takes a large document, splits it into manageable chunks, and emits chunked output you can feed into retrieval, summarization, or extraction pipelines. You need this when the source text is too large for a single model call, when you want deterministic chunk boundaries, or when you need to process documents in parallel without blowing token limits.
What You'll Need
- •Node.js 18+
- •TypeScript 5+
- •
@langchain/langgraph - •
@langchain/core - •
zod - •An OpenAI API key if you want to swap in an LLM later
- •A project configured with ESM or
ts-node/tsx
Install the packages:
npm install @langchain/langgraph @langchain/core zod
npm install -D typescript tsx @types/node
Step-by-Step
- •Start with a small graph state that holds the raw document and the chunk list. For chunking, keep the state simple: input text in, chunks out.
import { Annotation, StateGraph, START, END } from "@langchain/langgraph";
const ChunkState = Annotation.Root({
document: Annotation<string>,
chunks: Annotation<string[]>({
reducer: (_, update) => update,
default: () => [],
}),
});
- •Add a node that splits text by paragraph boundaries first, then enforces a max size per chunk. This gives cleaner chunks than naive fixed-width slicing while staying deterministic.
function splitDocument(document: string, maxChars = 1200): string[] {
const paragraphs = document
.split(/\n\s*\n/)
.map((p) => p.trim())
.filter(Boolean);
const chunks: string[] = [];
let current = "";
for (const paragraph of paragraphs) {
const candidate = current ? `${current}\n\n${paragraph}` : paragraph;
if (candidate.length <= maxChars) {
current = candidate;
continue;
}
if (current) chunks.push(current);
if (paragraph.length > maxChars) {
for (let i = 0; i < paragraph.length; i += maxChars) {
chunks.push(paragraph.slice(i, i + maxChars));
}
current = "";
continue;
}
current = paragraph;
}
if (current) chunks.push(current);
return chunks;
}
- •Wire the splitter into a LangGraph node. The node reads
documentfrom state and returnschunks, which LangGraph merges back into the graph state.
const chunkNode = async (state: typeof ChunkState.State) => {
const chunks = splitDocument(state.document, 1200);
return { chunks };
};
const graph = new StateGraph(ChunkState)
.addNode("chunk", chunkNode)
.addEdge(START, "chunk")
.addEdge("chunk", END)
.compile();
- •Run the graph against a real document string. In production this would come from S3, a database, or a file loader; here we keep it local so you can test immediately.
const documentText = `
Policy Summary
This policy covers accidental damage, theft, and fire.
Claims must be filed within thirty days of discovery.
Exclusions
We do not cover wear and tear, intentional damage, or fraud.
Supporting documentation is required for all claims.
`.trim();
const result = await graph.invoke({ document: documentText });
console.log(`Chunks produced: ${result.chunks.length}`);
result.chunks.forEach((chunk, index) => {
console.log(`\n--- Chunk ${index + 1} ---`);
console.log(chunk);
});
- •If you want this to scale beyond simple splitting, add metadata per chunk before returning it. That lets downstream nodes track source offsets, page numbers, or section names without re-parsing the original document.
type ChunkRecord = {
index: number;
text: string;
length: number;
};
function splitDocumentWithMetadata(document: string, maxChars = 1200): ChunkRecord[] {
return splitDocument(document, maxChars).map((text, index) => ({
index,
text,
length: text.length,
}));
}
Testing It
Run the script with tsx or your TypeScript runtime of choice and confirm that the output contains multiple chunks for longer input. Check that no chunk exceeds your configured size limit unless a single paragraph is already longer than that limit.
If you are feeding this into an embedding or extraction pipeline next, inspect whether paragraph boundaries are preserved cleanly. For legal or insurance documents, that matters more than raw token count because clause integrity affects downstream accuracy.
A good sanity check is to test three cases:
- •short document: one chunk only
- •medium document with several paragraphs: multiple clean chunks
- •one very long paragraph: hard-split fallback works
Next Steps
- •Add overlap between chunks so context carries across boundaries for retrieval use cases.
- •Replace the splitter with token-aware chunking using a tokenizer library instead of character counts.
- •Add a second LangGraph node that summarizes each chunk before storing embeddings or sending them to an LLM.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit