How to Fix 'token limit exceeded when scaling' in LangChain (Python)
If you’re seeing token limit exceeded when scaling in a LangChain Python app, you’re usually hitting a context-window problem, not a LangChain bug. It shows up when your chain, retriever, memory, or tool outputs keep adding tokens until the model refuses the request.
In practice, this happens during long chat sessions, retrieval-heavy RAG flows, or when you scale from a small test prompt to real production payloads. The fix is almost always to control what gets sent to the LLM before it crosses the model’s token limit.
The Most Common Cause
The #1 cause is unbounded chat history or document stuffing.
A lot of teams start with ConversationBufferMemory or pass every retrieved chunk into the prompt. That works for a few turns, then fails once the prompt grows beyond the model’s context window.
Here’s the broken pattern and the fixed pattern side by side:
| Broken | Fixed |
|---|---|
| Keeps appending all messages forever | Uses windowed or token-limited memory |
| Sends every retrieved chunk into the prompt | Truncates/filters retrieved context |
Fails with InvalidRequestError: This model's maximum context length is... or similar | Stays under token budget |
# BROKEN
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
llm = ChatOpenAI(model="gpt-3.5-turbo")
memory = ConversationBufferMemory(return_messages=True)
chain = ConversationChain(llm=llm, memory=memory)
for msg in long_user_session:
print(chain.invoke({"input": msg}))
# FIXED
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationTokenBufferMemory
from langchain.chains import ConversationChain
llm = ChatOpenAI(model="gpt-3.5-turbo")
# Keeps memory bounded by token count instead of message count
memory = ConversationTokenBufferMemory(
llm=llm,
max_token_limit=2000,
return_messages=True,
)
chain = ConversationChain(llm=llm, memory=memory)
for msg in long_user_session:
print(chain.invoke({"input": msg}))
If this is a RAG pipeline, the same issue appears when you stuff too many documents into context:
# BROKEN: dumps every retrieved chunk into the prompt
docs = retriever.get_relevant_documents(query)
context = "\n\n".join(doc.page_content for doc in docs)
response = llm.invoke(f"Answer using this context:\n{context}\n\nQuestion: {query}")
# FIXED: cap docs and trim content before prompting
docs = retriever.get_relevant_documents(query)[:4]
context = "\n\n".join(doc.page_content[:1500] for doc in docs)
response = llm.invoke(f"Answer using this context:\n{context}\n\nQuestion: {query}")
Other Possible Causes
1) You are using a smaller model than your test setup
A prompt that works on gpt-4o can fail on a smaller deployment model.
llm = ChatOpenAI(model="gpt-3.5-turbo") # smaller context than many newer models
Fix by checking the actual model context size and aligning your prompt budget to that limit.
2) Tool outputs are too large
Agent tool calls can return huge blobs: PDFs, SQL results, JSON payloads, logs.
# BROKEN: tool returns raw large data
def fetch_customer_history(customer_id: str):
return big_json_payload # thousands of tokens
# FIXED: summarize or slice before returning to agent
def fetch_customer_history(customer_id: str):
data = big_json_payload[:10]
return {"summary": summarize(data), "count": len(data)}
If you use AgentExecutor, remember that tool output gets fed back into the agent loop.
3) Retriever returns too many chunks
Default retriever settings often return more than you need.
retriever.search_kwargs = {"k": 10} # may be too high for long chunks
Reduce k, add metadata filters, or use MMR with tighter limits:
retriever.search_kwargs = {"k": 4}
4) Prompt templates are too verbose
Sometimes the issue is not memory or retrieval. It’s your own system prompt plus instructions plus examples.
prompt = """
You are an expert assistant.
Follow these 25 rules...
Here are 8 examples...
"""
Trim repeated instructions and move stable policy text into a shorter system message.
How to Debug It
- •
Measure token usage before calling the model
Use LangChain callbacks or inspect serialized inputs. If you’re using OpenAI models through LangChain, check prompt size right before
invoke(). - •
Print the final prompt payload
Don’t guess. Log the exact messages being sent:
messages = chain.prep_inputs({"input": query}) print(messages)If you see huge history blocks or giant document dumps, you found the problem.
- •
Disable components one by one
Remove memory first. Then remove retrieval. Then remove tools. Then test with a minimal prompt.
The component that makes it fail is your culprit.
- •
Check model limits against actual input size
A common LangChain-side error looks like:
- •
openai.BadRequestError: Error code: 400 - This model's maximum context length is ... - •
InvalidRequestError: This model's maximum context length has been exceeded - •
Token indices sequence length is longer than the specified maximum sequence length
If your input exceeds the model window, no chain configuration will save it without trimming.
- •
Prevention
- •Use token-aware memory classes like
ConversationTokenBufferMemory, not unbounded buffers. - •Cap retrieval aggressively:
- •lower
k - •truncate chunk text
- •filter by metadata before prompting
- •lower
- •Add a preflight token check before every LLM call in production pipelines.
A simple rule works well: if a request can grow over time, make it token-bounded from day one. That includes chat history, retrieved docs, tool outputs, and inline examples.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit