How to Fix 'context length exceeded when scaling' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21

context-length-exceeded-when-scalinglangchainpython

When LangChain throws context length exceeded when scaling, it means the prompt you’re sending to the model is bigger than the model’s context window. In practice, this usually shows up after you add more chat history, retrieve more documents, or keep appending tool output into the same chain.

This is almost always a prompt growth problem, not an LLM problem. The fix is usually to stop stuffing everything into one call and start controlling what gets passed into ChatOpenAI, ConversationalRetrievalChain, or your custom Runnable pipeline.

The Most Common Cause

The #1 cause is unbounded memory or history accumulation.

A lot of LangChain apps keep appending messages into ConversationBufferMemory, then pass the full transcript into every new call. That works for a few turns, then fails once the prompt crosses the model limit.

Broken pattern	Fixed pattern
`ConversationBufferMemory` keeps growing forever	Use `ConversationBufferWindowMemory` or summary memory
Full chat history passed on every turn	Keep only the last N messages
No token budgeting before retrieval	Cap retrieved docs and trim context

Broken code

from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferMemory(return_messages=True)

chain = ConversationChain(
    llm=llm,
    memory=memory,
)

# After enough turns, this can trigger:
# openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is ...'}}
for i in range(100):
    chain.predict(input=f"Turn {i}: summarize the last update and include all prior details.")

Fixed code

from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferWindowMemory

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

memory = ConversationBufferWindowMemory(
    k=6,                 # keep only last 6 turns
    return_messages=True
)

chain = ConversationChain(
    llm=llm,
    memory=memory,
)

for i in range(100):
    chain.predict(input=f"Turn {i}: summarize the last update and include only recent details.")

If you need longer-lived context, use ConversationSummaryMemory or a hybrid approach: summary + last N turns. Don’t let raw transcripts grow without bounds.

Other Possible Causes

1) Retriever returning too many documents

If you use RetrievalQA, ConversationalRetrievalChain, or your own retriever pipeline, a high k can flood the prompt with chunks.

retriever = vectorstore.as_retriever(search_kwargs={"k": 10})  # risky if chunks are large

Fix it by lowering k and chunk size:

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

Also check your splitter:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)

2) Tool outputs are being injected raw into the prompt

Agent loops often append verbose tool responses directly into message history. One large JSON response from an API can blow up context immediately.

# Bad: storing full tool output verbatim
messages.append({"role": "tool", "content": huge_json_response})

Trim it before passing it back:

messages.append({
    "role": "tool",
    "content": huge_json_response[:2000]  # or summarize it first
})

For production agents, summarize tool results into structured fields instead of dumping raw payloads.

3) You’re using a smaller-context model than you think

A lot of failures happen after switching models. For example, code that worked on a larger-context deployment starts failing on a smaller one like gpt-4o-mini.

llm = ChatOpenAI(model="gpt-4o-mini")  # smaller effective budget than you may expect in practice

Check your actual model limits and budget for:

•system prompt
•user input
•chat history
•retrieved docs
•tool outputs
•function call metadata

4) Prompt templates are duplicating content

Sometimes the same text gets inserted twice: once in memory and once in the template. This happens with custom chains that manually concatenate history plus use memory again.

prompt = f"""
History:
{history}

History again:
{history}

Question: {question}
"""

Fix by making one component responsible for history:

prompt = f"""
History:
{history}

Question: {question}
"""

If you’re using LangChain message objects, keep them as messages instead of flattening everything into one giant string.

How to Debug It

•
Print token counts before calling the model
- •Use a tokenizer or LangChain’s token utilities to measure prompt size.
- •If your prompt is already near the limit before retrieval, the issue is memory/history.
•
Log each prompt component separately
- •Split out system message, chat history, retrieved docs, and tool output.
- •One of these will usually dominate total size.
•
Reduce each variable one at a time
- •Set retriever k=1.
- •Disable memory.
- •Remove tool messages.
- •If the error disappears, you found the source.
•
Inspect the exact exception
- •
  Common real errors look like:
  - •openai.BadRequestError: Error code: 400
  - •"This model's maximum context length is ..."
  - •"Please reduce your input length"
- •If LangChain wraps it, trace back to the underlying provider error.

Prevention

•
Use bounded memory by default.
- •Prefer ConversationBufferWindowMemory or summary-based memory over raw buffers.
•
Budget tokens explicitly.
- •Reserve space for output before adding retrieval results or tools.
- •Keep chunks small and cap retriever results.
•
Treat long tool output as data, not prompt text.
- •Summarize API responses before reinserting them into agent state.

If you’re hitting this error in LangChain Python, start by looking at memory first. In most cases, fixing unbounded history solves it immediately.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit