How to Fix 'context length exceeded' in LangGraph (Python)
What the error means
context length exceeded usually means your graph is sending too much text to the LLM in a single call. In LangGraph, this typically happens after several node executions when state keeps accumulating messages, tool outputs, or documents without trimming.
You’ll usually see an error shaped like this:
openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 128000 tokens. However, your messages resulted in 131245 tokens.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}
The Most Common Cause
The #1 cause is appending full conversation state on every node run instead of keeping only the latest relevant messages. In LangGraph, this happens a lot when you use MessagesState or a custom state object and keep extending messages forever.
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Keep all messages forever | Trim messages before model calls |
| Pass raw tool output back into state | Summarize or store only needed fields |
| Rebuild prompts from full history every time | Use a bounded window |
# BROKEN: unbounded message growth
from langgraph.graph import StateGraph, MessagesState
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage
llm = ChatOpenAI(model="gpt-4o-mini")
def assistant_node(state: MessagesState):
# state["messages"] keeps growing forever
response = llm.invoke(state["messages"])
return {"messages": [response]}
graph = StateGraph(MessagesState)
graph.add_node("assistant", assistant_node)
# FIXED: trim before invoking the model
from langgraph.graph import StateGraph, MessagesState
from langchain_openai import ChatOpenAI
from langchain_core.messages import trim_messages
llm = ChatOpenAI(model="gpt-4o-mini")
def assistant_node(state: MessagesState):
trimmed = trim_messages(
state["messages"],
max_tokens=6000,
strategy="last",
token_counter=llm,
)
response = llm.invoke(trimmed)
return {"messages": [response]}
graph = StateGraph(MessagesState)
graph.add_node("assistant", assistant_node)
If you are using tools, the same issue appears when tool responses are large. A single PDF extraction or database dump can blow up the prompt fast.
Other Possible Causes
1) Tool output is being stored in chat history
If a tool returns a huge blob and you append it to messages, every later call pays for it.
# BAD: storing raw tool payload in messages
return {
"messages": [
ToolMessage(content=large_json_blob, tool_call_id=tool_call_id)
]
}
Use a summary or extract only the fields you need.
# BETTER: store compact result
return {
"messages": [
ToolMessage(content=f"Found {len(rows)} rows. Top match: {rows[0]['name']}", tool_call_id=tool_call_id)
]
}
2) You are re-injecting retrieved documents every turn
A common RAG bug is adding all retrieved chunks into every prompt, then also keeping them in graph state.
# BAD: stuffing all docs into prompt repeatedly
docs_text = "\n\n".join(doc.page_content for doc in docs)
prompt = f"Context:\n{docs_text}\n\nQuestion: {user_input}"
Instead, limit retrieval and compress long passages.
# BETTER: cap retrieval and summarize long docs first
docs = retriever.invoke(user_input)[:3]
docs_text = "\n\n".join(doc.page_content[:1500] for doc in docs)
3) Your reducer/merge logic duplicates messages
In LangGraph, bad state merging can duplicate history across branches. This often shows up after parallel nodes or conditional edges.
# BAD: custom merge appends duplicates
def merge_state(left, right):
return {"messages": left["messages"] + right["messages"]}
Use LangGraph’s message handling patterns instead of manual concatenation unless you really need custom logic.
4) System prompts and templates are too large
Sometimes the problem isn’t history. It’s a giant system prompt plus few-shot examples plus tool schemas.
SYSTEM_PROMPT = open("huge_prompt.txt").read()
FEW_SHOT_EXAMPLES = open("examples.txt").read()
Keep prompts tight. Move policy text into smaller rules or external retrieval if it truly needs to be dynamic.
How to Debug It
- •Print token growth per node
- •Log message count and approximate token count before each LLM call.
- •If it climbs every turn, you have an accumulation problem.
def debug_state(state, llm):
print("messages:", len(state["messages"]))
print("tokens:", llm.get_num_tokens_from_messages(state["messages"]))
- •
Identify which node triggers the spike
- •Add logging around each node invocation.
- •The failing node is often not where the bad content originated; it’s where everything gets assembled.
- •
Inspect tool outputs and retrieved chunks
- •Look for huge JSON blobs, HTML pages, PDFs, or database dumps.
- •If one tool output is massive, truncate or summarize before storing it.
- •
Check whether state is duplicated across branches
- •Review reducers and conditional edges.
- •Parallel paths that both append the same history will double your context quickly.
Prevention
- •
Use bounded memory from day one:
- •
trim_messages(...) - •rolling windows
- •summaries for older turns
- •
- •
Keep graph state small:
- •store IDs, not full payloads
- •persist large artifacts outside the conversation state
- •
Put token checks in CI or local tests:
- •simulate long conversations
- •fail builds when prompts exceed your target budget
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit