How to Fix 'context length exceeded when scaling' in LangGraph (Python)
What the error means
context length exceeded when scaling usually means your graph is sending too much conversation state into the LLM on each step. In LangGraph, this happens when messages keep accumulating across nodes, retries, or loops until the model’s token window is blown past.
You’ll usually hit it in agent graphs, multi-step workflows, or recursive loops where every node appends to messages without trimming, summarizing, or checkpointing correctly.
The Most Common Cause
The #1 cause is unbounded message growth in StateGraph state. People wire nodes so each step returns the full messages list plus new content, and LangGraph keeps merging it until the prompt becomes too large.
Here’s the broken pattern versus the fixed one:
| Broken | Fixed |
|---|---|
| Returns full history every time | Returns only the delta or a trimmed state |
| Never prunes old messages | Uses trim_messages() or summary state |
| Lets loops re-send everything | Keeps a bounded working set |
# Broken: message list grows forever
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_core.messages import HumanMessage, AIMessage
class State(TypedDict):
messages: Annotated[list, add_messages]
def chatbot(state: State):
# BAD: sends entire accumulated history to the model every turn
response = llm.invoke(state["messages"])
return {"messages": [response]} # still appends forever via add_messages
graph = StateGraph(State)
graph.add_node("chatbot", chatbot)
graph.add_edge(START, "chatbot")
graph.add_edge("chatbot", END)
app = graph.compile()
# Fixed: trim before invoking and keep only what you need
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_core.messages import trim_messages
class State(TypedDict):
messages: Annotated[list, add_messages]
def chatbot(state: State):
trimmed = trim_messages(
state["messages"],
max_tokens=4000,
strategy="last",
token_counter=llm.get_num_tokens_from_messages,
)
response = llm.invoke(trimmed)
return {"messages": [response]}
graph = StateGraph(State)
graph.add_node("chatbot", chatbot)
graph.add_edge(START, "chatbot")
graph.add_edge("chatbot", END)
app = graph.compile()
If you’re using an agent loop with tools, this gets worse because tool calls add more messages per iteration. The graph doesn’t know which messages are safe to drop unless you explicitly manage them.
Other Possible Causes
1) Recursive edges without a stop condition
A cycle like agent -> tool -> agent can keep resubmitting the same transcript.
# Bad: no termination guard
graph.add_edge("agent", "tool")
graph.add_edge("tool", "agent")
Fix it with a counter or conditional edge:
def should_continue(state):
return "tool" if state["steps"] < 5 else END
graph.add_conditional_edges("agent", should_continue)
2) Passing raw documents into state instead of references
If you store full retrieval results in messages or shared state, every node sees them again.
# Bad: huge docs added to conversational state
return {"messages": [HumanMessage(content="\n".join(doc.page_content for doc in docs))]}
Better:
# Good: store compact references or summaries
return {
"retrieved_doc_ids": [doc.metadata["id"] for doc in docs[:5]],
"context_summary": summarize_docs(docs[:3]),
}
3) Tool outputs are too verbose
Some tools return HTML pages, logs, PDFs converted to text, or giant JSON blobs.
# Bad: raw tool payload goes straight back into the LLM context
tool_result = requests.get(url).text
return {"messages": [AIMessage(content=tool_result)]}
Trim at the source:
tool_result = requests.get(url).text[:2000]
return {"messages": [AIMessage(content=tool_result)]}
Or better, summarize before injecting into the next LLM call.
4) Checkpointed state is replaying too much history
If you use a checkpointer like MemorySaver, SqliteSaver, or a custom persistence layer incorrectly, you may be restoring full message history on every run and then appending again.
# Example symptom: thread reloads full transcript every invocation
app.invoke(
{"messages": [HumanMessage(content="continue")]},
config={"configurable": {"thread_id": "abc123"}}
)
If your app resumes from a saved thread and also rehydrates prior messages manually, you’ll double-count context. Keep one source of truth for conversation state.
How to Debug It
- •
Print token counts at each node
- •Log
len(state["messages"]) - •Log estimated tokens before every
llm.invoke(...) - •If one node spikes hard, that’s your culprit
- •Log
- •
Inspect what each node returns
- •Look for nodes returning entire histories instead of deltas
- •Watch for accidental
return state - •In LangGraph reducers like
add_messages, returning old messages again duplicates context
- •
Check for loops and retries
- •Review
add_edge()andadd_conditional_edges() - •Confirm there’s an exit path from cyclic flows
- •If using retries on failures, make sure failed attempts don’t append full prompts again
- •Review
- •
Trace tool payload size
- •Log raw tool outputs before they become messages
- •Watch for large JSON responses from APIs or database dumps
- •If output exceeds a few KB and isn’t needed verbatim, summarize it first
A useful runtime check looks like this:
def log_state(state):
total_chars = sum(len(m.content) for m in state["messages"])
print(f"messages={len(state['messages'])} chars={total_chars}")
If that number climbs every step with no ceiling, you’ve found the scaling problem.
Prevention
- •
Use bounded memory by default:
- •Keep only recent turns in active context.
- •Store long-term history outside the prompt and retrieve selectively.
- •
Separate working state from audit state:
- •Active LLM context should stay small.
- •Persist full transcripts in storage, not in every node input.
- •
Put token checks in CI:
- •Add tests that simulate long conversations.
- •Fail builds if any node crosses your target context budget.
If you’re building agentic workflows in LangGraph Python, treat prompt size as a hard resource limit. Once you control message growth at the graph level, this error stops being mysterious and becomes just another capacity bug.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit