How to Fix 'token limit exceeded' in LangGraph (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceededlanggraphtypescript

What the error means

token limit exceeded in LangGraph usually means one of your graph nodes is sending too much conversation history or tool output to the LLM. In practice, it shows up when you keep appending messages to state without trimming, summarizing, or selecting only the relevant context.

The failure often happens inside a ChatOpenAI.invoke(...) call, a ToolNode, or right after a few graph loops when the messages array has grown too large for the model’s context window.

The Most Common Cause

The #1 cause is naive message accumulation in graph state. In LangGraph, it’s easy to keep pushing every user message, assistant reply, and tool result into messages, then pass the entire array back into the model on every step.

Here’s the broken pattern:

BrokenFixed
Keep appending all messages foreverTrim or summarize before each model call
// broken.ts
import { StateGraph, Annotation } from "@langchain/langgraph";
import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({ model: "gpt-4o-mini" });

const State = Annotation.Root({
  messages: Annotation<any[]>({
    reducer: (left, right) => left.concat(right),
    default: () => [],
  }),
});

async function assistantNode(state: typeof State.State) {
  const response = await llm.invoke(state.messages); // eventually fails
  return { messages: [response] };
}
// fixed.ts
import { trimMessages } from "@langchain/core/messages";
import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({ model: "gpt-4o-mini" });

async function assistantNode(state: { messages: any[] }) {
  const trimmed = trimMessages(state.messages, {
    maxTokens: 6000,
    strategy: "last",
    tokenCounter: llm,
    includeSystem: true,
  });

  const response = await llm.invoke(trimmed);
  return { messages: [response] };
}

If you’re using a reducer like concat, that is not the bug by itself. The bug is passing the full accumulated transcript into the model every turn without any token budget control.

Other Possible Causes

1) Tool output is too large

A single tool can blow up your prompt budget fast, especially if it returns raw JSON, search results, or documents.

// bad
return {
  messages: [
    {
      role: "tool",
      content: JSON.stringify(largeApiResponse),
      tool_call_id: call.id,
    },
  ],
};

Fix it by truncating or extracting only what the next step needs.

// better
const compact = {
  count: largeApiResponse.items.length,
  topResults: largeApiResponse.items.slice(0, 3).map(x => ({
    id: x.id,
    title: x.title,
  })),
};

return {
  messages: [
    {
      role: "tool",
      content: JSON.stringify(compact),
      tool_call_id: call.id,
    },
  ],
};

2) You are looping through a graph cycle too many times

A conditional edge that keeps routing back to the same node can accumulate context until the model hits its limit.

graph.addConditionalEdges("assistant", (state) => {
  if (state.messages.length < 20) return "tools";
  return "__end__";
});

If your loop depends only on message count, it can still grow too much. Add an explicit iteration counter and stop early.

const State = Annotation.Root({
  messages: Annotation<any[]>({ reducer: (l, r) => l.concat(r), default: () => [] }),
  steps: Annotation<number>({ reducer: (_, r) => r, default: () => 0 }),
});

3) System prompt + retrieved docs are oversized

RAG graphs often stuff entire documents into state. That works until retrieval returns multiple long chunks and your system prompt is already huge.

const context = docs.map(d => d.pageContent).join("\n\n");
await llm.invoke([
  { role: "system", content: systemPrompt },
  { role: "user", content: `Answer using this context:\n${context}` },
]);

Use smaller chunks and cap retrieval results.

const topDocs = docs.slice(0, 3);
const context = topDocs.map(d => d.pageContent.slice(0, 1500)).join("\n\n");

4) Memory persistence is replaying old state

If you use a checkpointer or persistent store, old thread history can come back on every run. That makes the graph look fine locally and fail later in production.

const app = graph.compile({ checkpointer });
await app.invoke(input, { configurable: { thread_id } });

Make sure you are not rehydrating years of conversation into a single thread. Add retention rules and summarize older turns before persisting them.

How to Debug It

  1. Print token usage at every LLM node

    • Log input size before each invoke.
    • If you use OpenAI-compatible models, inspect usage metadata after responses.
    • Watch for the node where input jumps sharply.
  2. Dump the exact payload being sent

    • Log state.messages length.
    • Log final rendered prompt text if you build one manually.
    • Look for giant tool outputs or repeated system instructions.
  3. Check whether a loop is expanding state

    • Count how many times each node runs per request.
    • If a node fires more than expected, inspect conditional edges.
    • Add a hard stop after N iterations while debugging.
  4. Test with trimmed state

    • Replace full history with last 5 turns.
    • Replace full documents with one short chunk.
    • If the error disappears, you’ve confirmed it’s context growth rather than a model/config issue.

Prevention

  • Use trimMessages(...) or summarization at every assistant boundary where history can grow.
  • Cap tool output before writing it back into graph state.
  • Add explicit limits for loop counts, retrieved documents, and persisted conversation length.
  • Treat messages as an input budget, not an append-only log.

If you want one rule to keep in mind: never let LangGraph decide how much context to send by accident. Make token budgeting part of your node design from day one.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides