How to Fix 'token limit exceeded in production' in LangGraph (Python)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceeded-in-productionlanggraphpython

You’re hitting this because the graph is sending more messages to the LLM than the model’s context window can accept. In LangGraph, this usually shows up after a few tool calls, retries, or long conversation threads when state keeps growing and nothing trims it.

The failure is typically not in the model itself. It’s in how you’re carrying messages through the graph, especially when you keep appending without pruning or summarizing.

The Most Common Cause

The #1 cause is unbounded message accumulation in graph state.

In LangGraph, people often use MessagesState or a custom state with a messages list, then keep appending every turn. That works for a while, then production traffic pushes the thread over the model limit and you get errors like:

  • openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is 128000 tokens...'}}
  • langchain_core.messages.ai.InvalidToolCall
  • langgraph.errors.GraphRecursionError in loops that keep adding context
  • ValueError: token limit exceeded

Broken vs fixed pattern

Broken patternFixed pattern
Keeps every message foreverTrims/summarizes old messages
Passes full history into every nodePasses only relevant window
No token budget checkEnforces max context before model call
# BROKEN: message history grows without bound

from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import MessagesState
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

def assistant_node(state: MessagesState):
    # Every turn includes the full history
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

builder = StateGraph(MessagesState)
builder.add_node("assistant", assistant_node)
builder.add_edge(START, "assistant")
builder.add_edge("assistant", END)

graph = builder.compile()
# FIXED: trim messages before calling the model

from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import MessagesState, trim_messages
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

def assistant_node(state: MessagesState):
    trimmed = trim_messages(
        state["messages"],
        max_tokens=6000,
        strategy="last",
        token_counter=llm.get_num_tokens_from_messages,
    )
    response = llm.invoke(trimmed)
    return {"messages": [response]}

builder = StateGraph(MessagesState)
builder.add_node("assistant", assistant_node)
builder.add_edge(START, "assistant")
builder.add_edge("assistant", END)

graph = builder.compile()

If your app has tool usage, this gets worse. Tool outputs are often verbose JSON blobs, and those get appended right back into state.

Other Possible Causes

1) Tool output is too large

A single tool call returning a huge payload can blow up your context faster than chat history.

# Bad: dump full API response into messages
return {
    "messages": [
        {
            "role": "tool",
            "content": str(big_api_response),
        }
    ]
}

Fix it by storing only what the LLM needs.

# Better: summarize or extract fields
summary = {
    "status": big_api_response["status"],
    "top_results": big_api_response["results"][:5],
}
return {"messages": [{"role": "tool", "content": str(summary)}]}

2) Recursive loops with no stop condition

If a node keeps routing back to itself or a tool loop keeps retrying, state grows until the model fails.

# Example of an unsafe loop route
def route(state):
    return "tool"  # never stops

Add explicit exit conditions and cap retries.

def route(state):
    if state.get("attempts", 0) >= 3:
        return END
    return "tool"

3) You’re using a model with a smaller context window than production traffic needs

This happens when dev uses one model and prod uses another deployment or provider alias.

llm = ChatOpenAI(model="gpt-4o-mini")  # fine in dev
# prod traffic + long threads => overflow

Check the actual deployed model limits and compare them to your average prompt size. Don’t assume aliases map to the same window everywhere.

4) Memory/state includes raw documents or retrieval chunks

People often attach full RAG chunks to graph state instead of passing just citations or excerpts.

state["documents"] = retrieved_docs  # too much text retained across turns

Instead, pass only top-k short snippets and drop everything else after generation.

How to Debug It

  1. Measure tokens at each node

    • Log input token count before every llm.invoke().
    • If you use LangChain models, call llm.get_num_tokens_from_messages(messages).
  2. Inspect graph state growth

    • Print len(state["messages"]) on every edge.
    • Look for tool outputs or retrieved docs being appended repeatedly.
  3. Reproduce with one thread ID

    • Run the same conversation thread locally until it fails.
    • If it only breaks after several turns, you have accumulation, not a single bad prompt.
  4. Check recursion and retry paths

    • Search for routes that can loop indefinitely.
    • Verify you’re not re-invoking nodes on validation errors without decrementing retries.

A practical debug hook looks like this:

def log_state(state, llm):
    msgs = state["messages"]
    print("message_count=", len(msgs))
    print("token_estimate=", llm.get_num_tokens_from_messages(msgs))

Run that before each model call. The first spike usually points to the real culprit.

Prevention

  • Trim messages at every LLM boundary using trim_messages() or a custom policy.
  • Keep tool outputs small; store raw payloads outside graph state if needed.
  • Put hard caps on retries, recursion depth, and retrieved document count.
  • Test with long-running threads in staging, not just single-turn prompts.
  • Add token budget logging per node so you catch growth before prod does.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides