LangGraph Tutorial (Python): chunking large documents for advanced developers
This tutorial shows how to build a LangGraph pipeline that splits large documents into token-aware chunks, processes each chunk independently, and keeps the workflow state clean for downstream retrieval or summarization. You need this when your inputs are too large for a single model call, or when you want chunk-level control for RAG, extraction, or compliance workflows.
What You'll Need
- •Python 3.10+
- •
langgraph - •
langchain-core - •
langchain-text-splitters - •
tiktoken - •Optional:
python-dotenvif you want to load env vars from a.envfile - •An OpenAI API key only if you plan to add model calls later; this tutorial does not require one
Install the packages:
pip install langgraph langchain-core langchain-text-splitters tiktoken
Step-by-Step
- •Start by defining a small state object that carries the document text, the generated chunks, and any derived metadata. Keep the graph state explicit; that makes debugging much easier when documents get large.
from typing import TypedDict, List
class ChunkState(TypedDict):
document: str
chunks: List[str]
chunk_sizes: List[int]
- •Next, create a node that splits the document into token-aware chunks. For production use, prefer a splitter that respects separators and overlap so you do not break clauses or table rows in awkward places.
from langchain_text_splitters import RecursiveCharacterTextSplitter
def split_document(state: ChunkState) -> dict:
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
model_name="gpt-4o-mini",
chunk_size=300,
chunk_overlap=50,
)
chunks = splitter.split_text(state["document"])
return {
"chunks": chunks,
"chunk_sizes": [len(chunk) for chunk in chunks],
}
- •Add a second node that transforms each chunk independently. In real systems this is where you would extract entities, generate embeddings, classify clauses, or send the chunk to an LLM.
def inspect_chunks(state: ChunkState) -> dict:
annotated = []
for i, chunk in enumerate(state["chunks"]):
annotated.append(f"Chunk {i + 1}: {len(chunk)} chars | {chunk[:80].replace('\n', ' ')}")
return {"chunks": annotated}
- •Wire the nodes together with LangGraph and compile the graph. This gives you a deterministic pipeline that can be reused across batch jobs or API handlers.
from langgraph.graph import StateGraph, START, END
graph = StateGraph(ChunkState)
graph.add_node("split_document", split_document)
graph.add_node("inspect_chunks", inspect_chunks)
graph.add_edge(START, "split_document")
graph.add_edge("split_document", "inspect_chunks")
graph.add_edge("inspect_chunks", END)
app = graph.compile()
- •Run the graph against a large input and inspect the output state. This example uses a synthetic document so you can execute it as-is without external dependencies.
sample_text = "\n\n".join([
"Section 1: Policy terms and conditions. " * 20,
"Section 2: Claims handling requirements. " * 20,
"Section 3: Exceptions and exclusions. " * 20,
])
result = app.invoke({
"document": sample_text,
"chunks": [],
"chunk_sizes": [],
})
print("Total chunks:", len(result["chunks"]))
print("Chunk sizes:", result["chunk_sizes"])
print(result["chunks"][0])
- •If you need chunk-level processing at scale, keep the splitting graph separate from downstream graphs. That lets you batch chunks into another LangGraph workflow later without re-reading or re-splitting the source document.
def persist_chunk_index(state: ChunkState) -> dict:
indexed = [
{"chunk_id": i, "length": size}
for i, size in enumerate(state["chunk_sizes"])
]
return {"chunk_sizes": [item["length"] for item in indexed]}
graph2 = StateGraph(ChunkState)
graph2.add_node("split_document", split_document)
graph2.add_node("persist_chunk_index", persist_chunk_index)
graph2.add_edge(START, "split_document")
graph2.add_edge("split_document", "persist_chunk_index")
graph2.add_edge("persist_chunk_index", END)
app2 = graph2.compile()
Testing It
Run the script and confirm three things: the graph compiles, the number of output chunks is greater than one for a long enough document, and each chunk stays within your target size range with overlap applied. If your chunks are too large, reduce chunk_size; if they are too fragmented, increase it or tune separators in RecursiveCharacterTextSplitter. For regulated workloads like insurance claims or banking policies, inspect random chunks manually to make sure section boundaries are preserved well enough for downstream retrieval.
Next Steps
- •Add an LLM node after splitting to summarize each chunk before aggregation.
- •Store
{chunk_id, text, metadata}in a vector database for RAG. - •Add conditional edges so oversized or malformed documents take a different path through the graph.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit