How to Fix 'context length exceeded' in LangChain (Python)
When you see context length exceeded in LangChain, the model is telling you that the total tokens in your prompt + chat history + retrieved documents + tool output are bigger than the model can handle. It usually shows up in chat apps, retrieval chains, or agent loops where memory keeps growing until the next LLM call fails.
In LangChain Python, this often surfaces as a provider error like:
- •
openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 16385 tokens..."}} - •
litellm.BadRequestError: context_length_exceeded - •
anthropic.BadRequestError: prompt is too long
The Most Common Cause
The #1 cause is unbounded conversation history or chain input growth.
A classic example is a ConversationBufferMemory or agent loop that keeps appending every message forever. Eventually, the next call to ChatOpenAI, ChatAnthropic, or another chat model blows past the context window.
| Broken pattern | Fixed pattern |
|---|---|
| Keep all messages forever | Trim, summarize, or window the memory |
| Pass full docs + full chat history every turn | Limit retrieved chunks and memory size |
# BROKEN: memory grows without bound
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
llm = ChatOpenAI(model="gpt-4o-mini")
memory = ConversationBufferMemory(return_messages=True)
chain = ConversationChain(llm=llm, memory=memory)
for msg in user_messages:
print(chain.invoke({"input": msg}))
# FIXED: use a bounded memory strategy
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationSummaryBufferMemory
from langchain.chains import ConversationChain
llm = ChatOpenAI(model="gpt-4o-mini")
memory = ConversationSummaryBufferMemory(
llm=llm,
max_token_limit=2000,
return_messages=True,
)
chain = ConversationChain(llm=llm, memory=memory)
for msg in user_messages:
print(chain.invoke({"input": msg}))
If you want stricter control, use a windowed memory instead:
from langchain.memory import ConversationBufferWindowMemory
memory = ConversationBufferWindowMemory(k=6, return_messages=True)
That keeps only the last k turns instead of the entire transcript.
Other Possible Causes
1) Retrieval returns too many chunks
If you're using RetrievalQA, ConversationalRetrievalChain, or a custom RAG pipeline, your retriever may be stuffing too many documents into the prompt.
# BROKEN
retriever = vectorstore.as_retriever(search_kwargs={"k": 12})
# FIXED
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
If chunks are large, reduce chunk size too:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
)
2) Tool output is being fed back raw
Agents can explode token usage when tool outputs are long and get appended to every loop iteration.
# BROKEN: huge tool output goes straight into the scratchpad/history
result = agent_executor.invoke({"input": "Summarize this PDF"})
Fix by truncating tool output or summarizing before re-inserting it:
def truncate(text: str, limit: int = 3000) -> str:
return text[:limit]
safe_output = truncate(tool_result)
3) Prompt template is too verbose
Sometimes the issue is not memory. It’s just an oversized system prompt plus examples plus user input.
# BROKEN: giant static prompt
prompt = ChatPromptTemplate.from_messages([
("system", VERY_LONG_POLICY_TEXT),
("human", "{question}")
])
Trim policy text and move stable instructions into shorter rules. If you need examples, keep one or two high-signal examples instead of dumping an entire spec.
4) Wrong model context window assumption
A lot of people assume all GPT-style models have the same limit. They don’t.
| Model family | Typical issue |
|---|---|
| Smaller chat models | Easy to exceed with RAG + history |
| Older OpenAI models | Lower context windows than expected |
| Anthropic / Bedrock wrappers | Provider-specific prompt limits |
Check your actual model name:
llm = ChatOpenAI(model="gpt-4o-mini") # not "whatever default"
Then verify its max context window from the provider docs before designing your chain.
How to Debug It
- •
Measure tokens before calling the model
Use a tokenizer or LangChain’s message formatting path to estimate prompt size.
from tiktoken import encoding_for_model enc = encoding_for_model("gpt-4o-mini") token_count = len(enc.encode(full_prompt_text)) print(token_count) - •
Log each component separately
Break down:
- •system prompt
- •chat history
- •retrieved documents
- •tool output
- •user input
One of these will usually dominate.
- •
Reduce one variable at a time
Temporarily set:
- •retriever
k=2 - •memory window to 2 turns
- •tool output truncation to 1000 chars
If the error disappears, you found the source.
- •retriever
- •
Inspect the exact provider error
Don’t stop at LangChain’s wrapper exception. Look for messages like:
- •
maximum context length - •
prompt is too long - •
context_length_exceeded - •
This model's maximum context length is ...
That tells you whether it’s input size, output size, or both.
- •
Prevention
- •
Use bounded memory by default:
- •
ConversationBufferWindowMemory - •
ConversationSummaryBufferMemory - •custom summarization after N turns
- •
- •
Put hard limits on retrieval and tools:
- •smaller
k - •smaller chunk sizes
- •truncate long tool outputs before they hit the LLM
- •smaller
- •
Add token checks in CI or preflight logging for any production chain that combines:
- •chat history
- •RAG context
- •agent tools
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit