How to Fix 'token limit exceeded in production' in LangGraph (Python)
You’re hitting this because the graph is sending more messages to the LLM than the model’s context window can accept. In LangGraph, this usually shows up after a few tool calls, retries, or long conversation threads when state keeps growing and nothing trims it.
The failure is typically not in the model itself. It’s in how you’re carrying messages through the graph, especially when you keep appending without pruning or summarizing.
The Most Common Cause
The #1 cause is unbounded message accumulation in graph state.
In LangGraph, people often use MessagesState or a custom state with a messages list, then keep appending every turn. That works for a while, then production traffic pushes the thread over the model limit and you get errors like:
- •
openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is 128000 tokens...'}} - •
langchain_core.messages.ai.InvalidToolCall - •
langgraph.errors.GraphRecursionErrorin loops that keep adding context - •
ValueError: token limit exceeded
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Keeps every message forever | Trims/summarizes old messages |
| Passes full history into every node | Passes only relevant window |
| No token budget check | Enforces max context before model call |
# BROKEN: message history grows without bound
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import MessagesState
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
def assistant_node(state: MessagesState):
# Every turn includes the full history
response = llm.invoke(state["messages"])
return {"messages": [response]}
builder = StateGraph(MessagesState)
builder.add_node("assistant", assistant_node)
builder.add_edge(START, "assistant")
builder.add_edge("assistant", END)
graph = builder.compile()
# FIXED: trim messages before calling the model
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import MessagesState, trim_messages
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
def assistant_node(state: MessagesState):
trimmed = trim_messages(
state["messages"],
max_tokens=6000,
strategy="last",
token_counter=llm.get_num_tokens_from_messages,
)
response = llm.invoke(trimmed)
return {"messages": [response]}
builder = StateGraph(MessagesState)
builder.add_node("assistant", assistant_node)
builder.add_edge(START, "assistant")
builder.add_edge("assistant", END)
graph = builder.compile()
If your app has tool usage, this gets worse. Tool outputs are often verbose JSON blobs, and those get appended right back into state.
Other Possible Causes
1) Tool output is too large
A single tool call returning a huge payload can blow up your context faster than chat history.
# Bad: dump full API response into messages
return {
"messages": [
{
"role": "tool",
"content": str(big_api_response),
}
]
}
Fix it by storing only what the LLM needs.
# Better: summarize or extract fields
summary = {
"status": big_api_response["status"],
"top_results": big_api_response["results"][:5],
}
return {"messages": [{"role": "tool", "content": str(summary)}]}
2) Recursive loops with no stop condition
If a node keeps routing back to itself or a tool loop keeps retrying, state grows until the model fails.
# Example of an unsafe loop route
def route(state):
return "tool" # never stops
Add explicit exit conditions and cap retries.
def route(state):
if state.get("attempts", 0) >= 3:
return END
return "tool"
3) You’re using a model with a smaller context window than production traffic needs
This happens when dev uses one model and prod uses another deployment or provider alias.
llm = ChatOpenAI(model="gpt-4o-mini") # fine in dev
# prod traffic + long threads => overflow
Check the actual deployed model limits and compare them to your average prompt size. Don’t assume aliases map to the same window everywhere.
4) Memory/state includes raw documents or retrieval chunks
People often attach full RAG chunks to graph state instead of passing just citations or excerpts.
state["documents"] = retrieved_docs # too much text retained across turns
Instead, pass only top-k short snippets and drop everything else after generation.
How to Debug It
- •
Measure tokens at each node
- •Log input token count before every
llm.invoke(). - •If you use LangChain models, call
llm.get_num_tokens_from_messages(messages).
- •Log input token count before every
- •
Inspect graph state growth
- •Print
len(state["messages"])on every edge. - •Look for tool outputs or retrieved docs being appended repeatedly.
- •Print
- •
Reproduce with one thread ID
- •Run the same conversation thread locally until it fails.
- •If it only breaks after several turns, you have accumulation, not a single bad prompt.
- •
Check recursion and retry paths
- •Search for routes that can loop indefinitely.
- •Verify you’re not re-invoking nodes on validation errors without decrementing retries.
A practical debug hook looks like this:
def log_state(state, llm):
msgs = state["messages"]
print("message_count=", len(msgs))
print("token_estimate=", llm.get_num_tokens_from_messages(msgs))
Run that before each model call. The first spike usually points to the real culprit.
Prevention
- •Trim messages at every LLM boundary using
trim_messages()or a custom policy. - •Keep tool outputs small; store raw payloads outside graph state if needed.
- •Put hard caps on retries, recursion depth, and retrieved document count.
- •Test with long-running threads in staging, not just single-turn prompts.
- •Add token budget logging per node so you catch growth before prod does.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit