How to Fix 'token limit exceeded when scaling' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceeded-when-scalinglangchainpython

If you’re seeing token limit exceeded when scaling in a LangChain Python app, you’re usually hitting a context-window problem, not a LangChain bug. It shows up when your chain, retriever, memory, or tool outputs keep adding tokens until the model refuses the request.

In practice, this happens during long chat sessions, retrieval-heavy RAG flows, or when you scale from a small test prompt to real production payloads. The fix is almost always to control what gets sent to the LLM before it crosses the model’s token limit.

The Most Common Cause

The #1 cause is unbounded chat history or document stuffing.

A lot of teams start with ConversationBufferMemory or pass every retrieved chunk into the prompt. That works for a few turns, then fails once the prompt grows beyond the model’s context window.

Here’s the broken pattern and the fixed pattern side by side:

BrokenFixed
Keeps appending all messages foreverUses windowed or token-limited memory
Sends every retrieved chunk into the promptTruncates/filters retrieved context
Fails with InvalidRequestError: This model's maximum context length is... or similarStays under token budget
# BROKEN
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

llm = ChatOpenAI(model="gpt-3.5-turbo")
memory = ConversationBufferMemory(return_messages=True)

chain = ConversationChain(llm=llm, memory=memory)

for msg in long_user_session:
    print(chain.invoke({"input": msg}))
# FIXED
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationTokenBufferMemory
from langchain.chains import ConversationChain

llm = ChatOpenAI(model="gpt-3.5-turbo")

# Keeps memory bounded by token count instead of message count
memory = ConversationTokenBufferMemory(
    llm=llm,
    max_token_limit=2000,
    return_messages=True,
)

chain = ConversationChain(llm=llm, memory=memory)

for msg in long_user_session:
    print(chain.invoke({"input": msg}))

If this is a RAG pipeline, the same issue appears when you stuff too many documents into context:

# BROKEN: dumps every retrieved chunk into the prompt
docs = retriever.get_relevant_documents(query)
context = "\n\n".join(doc.page_content for doc in docs)
response = llm.invoke(f"Answer using this context:\n{context}\n\nQuestion: {query}")
# FIXED: cap docs and trim content before prompting
docs = retriever.get_relevant_documents(query)[:4]
context = "\n\n".join(doc.page_content[:1500] for doc in docs)
response = llm.invoke(f"Answer using this context:\n{context}\n\nQuestion: {query}")

Other Possible Causes

1) You are using a smaller model than your test setup

A prompt that works on gpt-4o can fail on a smaller deployment model.

llm = ChatOpenAI(model="gpt-3.5-turbo")  # smaller context than many newer models

Fix by checking the actual model context size and aligning your prompt budget to that limit.

2) Tool outputs are too large

Agent tool calls can return huge blobs: PDFs, SQL results, JSON payloads, logs.

# BROKEN: tool returns raw large data
def fetch_customer_history(customer_id: str):
    return big_json_payload  # thousands of tokens

# FIXED: summarize or slice before returning to agent
def fetch_customer_history(customer_id: str):
    data = big_json_payload[:10]
    return {"summary": summarize(data), "count": len(data)}

If you use AgentExecutor, remember that tool output gets fed back into the agent loop.

3) Retriever returns too many chunks

Default retriever settings often return more than you need.

retriever.search_kwargs = {"k": 10}  # may be too high for long chunks

Reduce k, add metadata filters, or use MMR with tighter limits:

retriever.search_kwargs = {"k": 4}

4) Prompt templates are too verbose

Sometimes the issue is not memory or retrieval. It’s your own system prompt plus instructions plus examples.

prompt = """
You are an expert assistant.
Follow these 25 rules...
Here are 8 examples...
"""

Trim repeated instructions and move stable policy text into a shorter system message.

How to Debug It

  1. Measure token usage before calling the model

    Use LangChain callbacks or inspect serialized inputs. If you’re using OpenAI models through LangChain, check prompt size right before invoke().

  2. Print the final prompt payload

    Don’t guess. Log the exact messages being sent:

    messages = chain.prep_inputs({"input": query})
    print(messages)
    

    If you see huge history blocks or giant document dumps, you found the problem.

  3. Disable components one by one

    Remove memory first. Then remove retrieval. Then remove tools. Then test with a minimal prompt.

    The component that makes it fail is your culprit.

  4. Check model limits against actual input size

    A common LangChain-side error looks like:

    • openai.BadRequestError: Error code: 400 - This model's maximum context length is ...
    • InvalidRequestError: This model's maximum context length has been exceeded
    • Token indices sequence length is longer than the specified maximum sequence length

    If your input exceeds the model window, no chain configuration will save it without trimming.

Prevention

  • Use token-aware memory classes like ConversationTokenBufferMemory, not unbounded buffers.
  • Cap retrieval aggressively:
    • lower k
    • truncate chunk text
    • filter by metadata before prompting
  • Add a preflight token check before every LLM call in production pipelines.

A simple rule works well: if a request can grow over time, make it token-bounded from day one. That includes chat history, retrieved docs, tool outputs, and inline examples.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides