How to Fix 'token limit exceeded' in LangGraph (TypeScript)
What the error means
token limit exceeded in LangGraph usually means one of your graph nodes is sending too much conversation history or tool output to the LLM. In practice, it shows up when you keep appending messages to state without trimming, summarizing, or selecting only the relevant context.
The failure often happens inside a ChatOpenAI.invoke(...) call, a ToolNode, or right after a few graph loops when the messages array has grown too large for the model’s context window.
The Most Common Cause
The #1 cause is naive message accumulation in graph state. In LangGraph, it’s easy to keep pushing every user message, assistant reply, and tool result into messages, then pass the entire array back into the model on every step.
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Keep appending all messages forever | Trim or summarize before each model call |
// broken.ts
import { StateGraph, Annotation } from "@langchain/langgraph";
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
const State = Annotation.Root({
messages: Annotation<any[]>({
reducer: (left, right) => left.concat(right),
default: () => [],
}),
});
async function assistantNode(state: typeof State.State) {
const response = await llm.invoke(state.messages); // eventually fails
return { messages: [response] };
}
// fixed.ts
import { trimMessages } from "@langchain/core/messages";
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
async function assistantNode(state: { messages: any[] }) {
const trimmed = trimMessages(state.messages, {
maxTokens: 6000,
strategy: "last",
tokenCounter: llm,
includeSystem: true,
});
const response = await llm.invoke(trimmed);
return { messages: [response] };
}
If you’re using a reducer like concat, that is not the bug by itself. The bug is passing the full accumulated transcript into the model every turn without any token budget control.
Other Possible Causes
1) Tool output is too large
A single tool can blow up your prompt budget fast, especially if it returns raw JSON, search results, or documents.
// bad
return {
messages: [
{
role: "tool",
content: JSON.stringify(largeApiResponse),
tool_call_id: call.id,
},
],
};
Fix it by truncating or extracting only what the next step needs.
// better
const compact = {
count: largeApiResponse.items.length,
topResults: largeApiResponse.items.slice(0, 3).map(x => ({
id: x.id,
title: x.title,
})),
};
return {
messages: [
{
role: "tool",
content: JSON.stringify(compact),
tool_call_id: call.id,
},
],
};
2) You are looping through a graph cycle too many times
A conditional edge that keeps routing back to the same node can accumulate context until the model hits its limit.
graph.addConditionalEdges("assistant", (state) => {
if (state.messages.length < 20) return "tools";
return "__end__";
});
If your loop depends only on message count, it can still grow too much. Add an explicit iteration counter and stop early.
const State = Annotation.Root({
messages: Annotation<any[]>({ reducer: (l, r) => l.concat(r), default: () => [] }),
steps: Annotation<number>({ reducer: (_, r) => r, default: () => 0 }),
});
3) System prompt + retrieved docs are oversized
RAG graphs often stuff entire documents into state. That works until retrieval returns multiple long chunks and your system prompt is already huge.
const context = docs.map(d => d.pageContent).join("\n\n");
await llm.invoke([
{ role: "system", content: systemPrompt },
{ role: "user", content: `Answer using this context:\n${context}` },
]);
Use smaller chunks and cap retrieval results.
const topDocs = docs.slice(0, 3);
const context = topDocs.map(d => d.pageContent.slice(0, 1500)).join("\n\n");
4) Memory persistence is replaying old state
If you use a checkpointer or persistent store, old thread history can come back on every run. That makes the graph look fine locally and fail later in production.
const app = graph.compile({ checkpointer });
await app.invoke(input, { configurable: { thread_id } });
Make sure you are not rehydrating years of conversation into a single thread. Add retention rules and summarize older turns before persisting them.
How to Debug It
- •
Print token usage at every LLM node
- •Log input size before each
invoke. - •If you use OpenAI-compatible models, inspect usage metadata after responses.
- •Watch for the node where input jumps sharply.
- •Log input size before each
- •
Dump the exact payload being sent
- •Log
state.messageslength. - •Log final rendered prompt text if you build one manually.
- •Look for giant tool outputs or repeated system instructions.
- •Log
- •
Check whether a loop is expanding state
- •Count how many times each node runs per request.
- •If a node fires more than expected, inspect conditional edges.
- •Add a hard stop after N iterations while debugging.
- •
Test with trimmed state
- •Replace full history with last 5 turns.
- •Replace full documents with one short chunk.
- •If the error disappears, you’ve confirmed it’s context growth rather than a model/config issue.
Prevention
- •Use
trimMessages(...)or summarization at every assistant boundary where history can grow. - •Cap tool output before writing it back into graph state.
- •Add explicit limits for loop counts, retrieved documents, and persisted conversation length.
- •Treat
messagesas an input budget, not an append-only log.
If you want one rule to keep in mind: never let LangGraph decide how much context to send by accident. Make token budgeting part of your node design from day one.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit