How to Fix 'context length exceeded' in LangGraph (Python)

By Cyprian AaronsUpdated 2026-04-21
context-length-exceededlanggraphpython

What the error means

context length exceeded usually means your graph is sending too much text to the LLM in a single call. In LangGraph, this typically happens after several node executions when state keeps accumulating messages, tool outputs, or documents without trimming.

You’ll usually see an error shaped like this:

openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 128000 tokens. However, your messages resulted in 131245 tokens.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

The Most Common Cause

The #1 cause is appending full conversation state on every node run instead of keeping only the latest relevant messages. In LangGraph, this happens a lot when you use MessagesState or a custom state object and keep extending messages forever.

Broken vs fixed pattern

Broken patternFixed pattern
Keep all messages foreverTrim messages before model calls
Pass raw tool output back into stateSummarize or store only needed fields
Rebuild prompts from full history every timeUse a bounded window
# BROKEN: unbounded message growth
from langgraph.graph import StateGraph, MessagesState
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage

llm = ChatOpenAI(model="gpt-4o-mini")

def assistant_node(state: MessagesState):
    # state["messages"] keeps growing forever
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

graph = StateGraph(MessagesState)
graph.add_node("assistant", assistant_node)
# FIXED: trim before invoking the model
from langgraph.graph import StateGraph, MessagesState
from langchain_openai import ChatOpenAI
from langchain_core.messages import trim_messages

llm = ChatOpenAI(model="gpt-4o-mini")

def assistant_node(state: MessagesState):
    trimmed = trim_messages(
        state["messages"],
        max_tokens=6000,
        strategy="last",
        token_counter=llm,
    )
    response = llm.invoke(trimmed)
    return {"messages": [response]}

graph = StateGraph(MessagesState)
graph.add_node("assistant", assistant_node)

If you are using tools, the same issue appears when tool responses are large. A single PDF extraction or database dump can blow up the prompt fast.

Other Possible Causes

1) Tool output is being stored in chat history

If a tool returns a huge blob and you append it to messages, every later call pays for it.

# BAD: storing raw tool payload in messages
return {
    "messages": [
        ToolMessage(content=large_json_blob, tool_call_id=tool_call_id)
    ]
}

Use a summary or extract only the fields you need.

# BETTER: store compact result
return {
    "messages": [
        ToolMessage(content=f"Found {len(rows)} rows. Top match: {rows[0]['name']}", tool_call_id=tool_call_id)
    ]
}

2) You are re-injecting retrieved documents every turn

A common RAG bug is adding all retrieved chunks into every prompt, then also keeping them in graph state.

# BAD: stuffing all docs into prompt repeatedly
docs_text = "\n\n".join(doc.page_content for doc in docs)
prompt = f"Context:\n{docs_text}\n\nQuestion: {user_input}"

Instead, limit retrieval and compress long passages.

# BETTER: cap retrieval and summarize long docs first
docs = retriever.invoke(user_input)[:3]
docs_text = "\n\n".join(doc.page_content[:1500] for doc in docs)

3) Your reducer/merge logic duplicates messages

In LangGraph, bad state merging can duplicate history across branches. This often shows up after parallel nodes or conditional edges.

# BAD: custom merge appends duplicates
def merge_state(left, right):
    return {"messages": left["messages"] + right["messages"]}

Use LangGraph’s message handling patterns instead of manual concatenation unless you really need custom logic.

4) System prompts and templates are too large

Sometimes the problem isn’t history. It’s a giant system prompt plus few-shot examples plus tool schemas.

SYSTEM_PROMPT = open("huge_prompt.txt").read()
FEW_SHOT_EXAMPLES = open("examples.txt").read()

Keep prompts tight. Move policy text into smaller rules or external retrieval if it truly needs to be dynamic.

How to Debug It

  1. Print token growth per node
    • Log message count and approximate token count before each LLM call.
    • If it climbs every turn, you have an accumulation problem.
def debug_state(state, llm):
    print("messages:", len(state["messages"]))
    print("tokens:", llm.get_num_tokens_from_messages(state["messages"]))
  1. Identify which node triggers the spike

    • Add logging around each node invocation.
    • The failing node is often not where the bad content originated; it’s where everything gets assembled.
  2. Inspect tool outputs and retrieved chunks

    • Look for huge JSON blobs, HTML pages, PDFs, or database dumps.
    • If one tool output is massive, truncate or summarize before storing it.
  3. Check whether state is duplicated across branches

    • Review reducers and conditional edges.
    • Parallel paths that both append the same history will double your context quickly.

Prevention

  • Use bounded memory from day one:

    • trim_messages(...)
    • rolling windows
    • summaries for older turns
  • Keep graph state small:

    • store IDs, not full payloads
    • persist large artifacts outside the conversation state
  • Put token checks in CI or local tests:

    • simulate long conversations
    • fail builds when prompts exceed your target budget

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides