LangGraph Tutorial (TypeScript): chunking large documents for advanced developers

By Cyprian AaronsUpdated 2026-04-22
langgraphchunking-large-documents-for-advanced-developerstypescript

This tutorial shows how to build a LangGraph workflow in TypeScript that takes a large document, splits it into token-safe chunks, and routes those chunks through a graph for downstream processing. You need this when your source text is too large for a single model call, or when you want deterministic chunking before extraction, summarization, or retrieval.

What You'll Need

  • Node.js 18+
  • TypeScript 5+
  • npm or pnpm
  • An OpenAI API key
  • Packages:
    • @langchain/core
    • @langchain/openai
    • @langchain/textsplitters
    • @langgraph/langgraph
    • ts-node or tsx for local execution

Step-by-Step

  1. Start with a graph state that can carry the original document, the chunk list, and the current chunk index. Keep the state explicit; if you hide chunk metadata in closures, debugging becomes painful once you start branching or retrying nodes.
import { StateGraph, START, END, Annotation } from "@langgraph/langgraph";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const GraphState = Annotation.Root({
  document: Annotation<string>(),
  chunks: Annotation<string[]>({
    default: () => [],
    reducer: (_, next) => next,
  }),
  currentChunkIndex: Annotation<number>({
    default: () => 0,
    reducer: (_, next) => next,
  }),
});

type GraphStateType = typeof GraphState.State;
  1. Add a chunking node that uses a real text splitter. For large documents, token-aware splitting is safer than naive paragraph splitting because it keeps each chunk within model limits while preserving enough context for downstream extraction.
const splitDocument = async (state: GraphStateType) => {
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 150,
  });

  const chunks = await splitter.splitText(state.document);

  return {
    chunks,
    currentChunkIndex: 0,
  };
};
  1. Add a processing node that handles one chunk at a time. In production you usually extract structured data, classify content, or summarize each chunk before aggregating results; here we keep it simple and return the active chunk so the control flow is easy to inspect.
const processChunk = async (state: GraphStateType) => {
  const chunk = state.chunks[state.currentChunkIndex];

  console.log(`Processing chunk ${state.currentChunkIndex + 1}/${state.chunks.length}`);
  console.log(chunk.slice(0, 120).replace(/\n/g, " ") + "...");

  return {};
};
  1. Add routing logic so the graph loops over every chunk and then exits cleanly. This is the part that makes LangGraph useful for document pipelines: you keep orchestration explicit instead of burying loops inside application code.
const routeNext = (state: GraphStateType) => {
  const nextIndex = state.currentChunkIndex + 1;
  if (nextIndex < state.chunks.length) {
    return "processChunk";
  }
  return END;
};

const advanceIndex = (state: GraphStateType) => ({
  currentChunkIndex: state.currentChunkIndex + 1,
});
  1. Wire everything together and run it against a large sample document. The graph first splits once, then iterates through each chunk deterministically.
const graph = new StateGraph(GraphState)
  .addNode("splitDocument", splitDocument)
  .addNode("processChunk", processChunk)
  .addNode("advanceIndex", advanceIndex)
  .addEdge(START, "splitDocument")
  .addEdge("splitDocument", "processChunk")
  .addEdge("processChunk", "advanceIndex")
  .addConditionalEdges("advanceIndex", routeNext);

const app = graph.compile();

const document = `
# Policy Overview

This is a long policy document intended to demonstrate chunking.
It contains multiple sections, examples, and repeated paragraphs so that the
text splitter produces several chunks for downstream processing.

${"Additional policy details.\n".repeat(200)}
`;

const result = await app.invoke({ document });
console.log("Done:", result.currentChunkIndex);

Testing It

Run the script with tsx or compile it with tsc and execute the output with Node. You should see logs for each processed chunk in order, and the final currentChunkIndex should match the number of chunks minus one.

To verify boundary behavior, shrink chunkSize to something like 300 and confirm that more chunks are produced. Then increase overlap and check that adjacent chunks share some repeated text near their boundaries.

If you plan to feed each chunk into an LLM call later, test with real documents that include tables, bullet lists, and long paragraphs. Those are the cases where bad splitting shows up fast.

Next Steps

  • Add an LLM node with ChatOpenAI to summarize or extract fields from each chunk.
  • Persist intermediate outputs per chunk so you can retry failed items without rerunning the whole document.
  • Replace simple sequential processing with parallel fan-out if your downstream step is stateless.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides