How to Fix 'context length exceeded during development' in LangChain (Python)
What the error means
context length exceeded during development usually means you sent more tokens to the model than its context window allows. In LangChain, this shows up when your prompt, chat history, retrieved documents, or tool outputs get concatenated into one request and push the total past the model limit.
You’ll typically hit it during iterative development with ConversationBufferMemory, large retriever results, or when you keep appending messages without trimming.
The Most Common Cause
The #1 cause is unbounded chat history. Developers use ConversationBufferMemory or manually append every turn, then pass the full transcript into every LLMChain or ChatPromptTemplate call.
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Keeps every message forever | Trims or summarizes history |
| No token budgeting | Uses bounded memory |
Easy to hit InvalidRequestError / 400 context_length_exceeded | Stays under model limits |
# BROKEN
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
llm = ChatOpenAI(model="gpt-3.5-turbo") # small context window
memory = ConversationBufferMemory(return_messages=True)
chain = ConversationChain(
llm=llm,
memory=memory,
verbose=True,
)
print(chain.predict(input="Explain our refund policy"))
print(chain.predict(input="Now summarize the exceptions"))
After a few turns, LangChain sends the entire message list back to the model. OpenAI will respond with errors like:
- •
openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is 4097 tokens...'}} - •
InvalidRequestError: This model's maximum context length is ... - •
context_length_exceeded
Use bounded memory instead:
# FIXED
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationTokenBufferMemory
llm = ChatOpenAI(model="gpt-4o-mini", max_tokens=500)
memory = ConversationTokenBufferMemory(
llm=llm,
max_token_limit=2000,
return_messages=True,
)
chain = ConversationChain(
llm=llm,
memory=memory,
verbose=True,
)
print(chain.predict(input="Explain our refund policy"))
print(chain.predict(input="Now summarize the exceptions"))
If you need long-running conversations, use summary memory instead of raw buffers:
from langchain.memory import ConversationSummaryBufferMemory
That keeps recent turns plus a rolling summary.
Other Possible Causes
1. Retriever returns too many documents
A common RAG bug is setting k too high and stuffing all retrieved chunks into the prompt.
# Too much context
retriever = vectorstore.as_retriever(search_kwargs={"k": 12})
docs = retriever.get_relevant_documents(query)
Fix it by lowering k, chunking better, and filtering aggressively.
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
If your chunks are huge, reduce chunk size too:
RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
2. Tool output is being injected raw into the prompt
Agents can blow up context when tool responses are large JSON blobs, HTML pages, or database dumps.
# Problematic tool output gets appended directly
tool_result = fetch_customer_record(customer_id)
prompt = f"Answer using this data:\n{tool_result}"
Fix by summarizing before passing it downstream.
summary = llm.invoke(f"Summarize this customer record in 8 bullets:\n{tool_result}")
3. Prompt template includes static text that is too large
Sometimes the issue is not dynamic history at all. It’s a giant system prompt, policy dump, or copied knowledge base text inside ChatPromptTemplate.
prompt = ChatPromptTemplate.from_messages([
("system", open("policy_manual.txt").read()),
("human", "{input}")
])
If that file is long, every request pays for it again. Move static content to retrieval or compress it into a shorter system prompt.
4. You are using the wrong model for the job
Some models have small context windows. If you’re testing with a short-context model and feeding it long transcripts, you’ll keep hitting errors even if your code is fine.
| Model type | Typical risk |
|---|---|
| Small-context chat models | Frequent overflow |
| Larger-context models | Better for long docs/conversations |
Switch to a larger-context model when your app needs long histories or document-heavy prompts.
How to Debug It
- •
Print token usage before calling the model
Log prompt size, retrieved docs length, and conversation turns.print(len(messages)) print(len(str(docs))) - •
Check whether memory is growing without bounds
If you useConversationBufferMemory, inspect how many messages are being carried forward each turn. - •
Disable retrieval and tools temporarily
Run the chain with only a short user message. If the error disappears, the overflow is coming from docs or tool output. - •
Binary search your prompt
Remove half of the messages/docs/prompt text, retry, and narrow down which component pushes you over limit.
A practical trick: log the exact payload sent to the model via LangChain callbacks or by printing formatted messages before invocation.
Prevention
- •Use bounded memory by default:
- •
ConversationTokenBufferMemory - •
ConversationSummaryBufferMemory
- •
- •Cap retrieval:
- •keep
ksmall - •use smaller chunks
- •filter irrelevant docs early
- •keep
- •Budget tokens explicitly:
- •reserve space for completion output
- •don’t fill 95% of context with input text
If you build agents for production systems like banking or insurance workflows, treat context as a finite resource. Every new message, document chunk, and tool result needs a budget before it enters the prompt.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit