How to Fix 'context length exceeded when scaling' in LangChain (Python)
When LangChain throws context length exceeded when scaling, it means the prompt you’re sending to the model is bigger than the model’s context window. In practice, this usually shows up after you add more chat history, retrieve more documents, or keep appending tool output into the same chain.
This is almost always a prompt growth problem, not an LLM problem. The fix is usually to stop stuffing everything into one call and start controlling what gets passed into ChatOpenAI, ConversationalRetrievalChain, or your custom Runnable pipeline.
The Most Common Cause
The #1 cause is unbounded memory or history accumulation.
A lot of LangChain apps keep appending messages into ConversationBufferMemory, then pass the full transcript into every new call. That works for a few turns, then fails once the prompt crosses the model limit.
| Broken pattern | Fixed pattern |
|---|---|
ConversationBufferMemory keeps growing forever | Use ConversationBufferWindowMemory or summary memory |
| Full chat history passed on every turn | Keep only the last N messages |
| No token budgeting before retrieval | Cap retrieved docs and trim context |
Broken code
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferMemory(return_messages=True)
chain = ConversationChain(
llm=llm,
memory=memory,
)
# After enough turns, this can trigger:
# openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is ...'}}
for i in range(100):
chain.predict(input=f"Turn {i}: summarize the last update and include all prior details.")
Fixed code
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferWindowMemory
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferWindowMemory(
k=6, # keep only last 6 turns
return_messages=True
)
chain = ConversationChain(
llm=llm,
memory=memory,
)
for i in range(100):
chain.predict(input=f"Turn {i}: summarize the last update and include only recent details.")
If you need longer-lived context, use ConversationSummaryMemory or a hybrid approach: summary + last N turns. Don’t let raw transcripts grow without bounds.
Other Possible Causes
1) Retriever returning too many documents
If you use RetrievalQA, ConversationalRetrievalChain, or your own retriever pipeline, a high k can flood the prompt with chunks.
retriever = vectorstore.as_retriever(search_kwargs={"k": 10}) # risky if chunks are large
Fix it by lowering k and chunk size:
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
Also check your splitter:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
2) Tool outputs are being injected raw into the prompt
Agent loops often append verbose tool responses directly into message history. One large JSON response from an API can blow up context immediately.
# Bad: storing full tool output verbatim
messages.append({"role": "tool", "content": huge_json_response})
Trim it before passing it back:
messages.append({
"role": "tool",
"content": huge_json_response[:2000] # or summarize it first
})
For production agents, summarize tool results into structured fields instead of dumping raw payloads.
3) You’re using a smaller-context model than you think
A lot of failures happen after switching models. For example, code that worked on a larger-context deployment starts failing on a smaller one like gpt-4o-mini.
llm = ChatOpenAI(model="gpt-4o-mini") # smaller effective budget than you may expect in practice
Check your actual model limits and budget for:
- •system prompt
- •user input
- •chat history
- •retrieved docs
- •tool outputs
- •function call metadata
4) Prompt templates are duplicating content
Sometimes the same text gets inserted twice: once in memory and once in the template. This happens with custom chains that manually concatenate history plus use memory again.
prompt = f"""
History:
{history}
History again:
{history}
Question: {question}
"""
Fix by making one component responsible for history:
prompt = f"""
History:
{history}
Question: {question}
"""
If you’re using LangChain message objects, keep them as messages instead of flattening everything into one giant string.
How to Debug It
- •
Print token counts before calling the model
- •Use a tokenizer or LangChain’s token utilities to measure prompt size.
- •If your prompt is already near the limit before retrieval, the issue is memory/history.
- •
Log each prompt component separately
- •Split out system message, chat history, retrieved docs, and tool output.
- •One of these will usually dominate total size.
- •
Reduce each variable one at a time
- •Set retriever
k=1. - •Disable memory.
- •Remove tool messages.
- •If the error disappears, you found the source.
- •Set retriever
- •
Inspect the exact exception
- •Common real errors look like:
- •
openai.BadRequestError: Error code: 400 - •
"This model's maximum context length is ..." - •
"Please reduce your input length"
- •
- •If LangChain wraps it, trace back to the underlying provider error.
- •Common real errors look like:
Prevention
- •
Use bounded memory by default.
- •Prefer
ConversationBufferWindowMemoryor summary-based memory over raw buffers.
- •Prefer
- •
Budget tokens explicitly.
- •Reserve space for output before adding retrieval results or tools.
- •Keep chunks small and cap retriever results.
- •
Treat long tool output as data, not prompt text.
- •Summarize API responses before reinserting them into agent state.
If you’re hitting this error in LangChain Python, start by looking at memory first. In most cases, fixing unbounded history solves it immediately.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit