How to Fix 'context length exceeded in production' in LangChain (Python)
When you see context length exceeded in production, it means the prompt you sent to the model is larger than the model’s context window. In LangChain, this usually shows up after you’ve chained together chat history, retrieved documents, tool outputs, and a fresh user message into one request.
This is not a LangChain bug in most cases. It’s almost always an application design issue: you’re feeding too much text into a model that has a hard token limit.
The Most Common Cause
The #1 cause is unbounded conversation history being passed into ChatOpenAI through a chain or agent.
A common failure pattern is appending every prior message forever, then sending the whole list on every turn. Once the conversation gets long enough, OpenAI returns errors like:
- •
BadRequestError: This model's maximum context length is 16385 tokens. However, your messages resulted in 18240 tokens. - •
openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length has been exceeded'}}
Broken vs fixed
| Broken pattern | Fixed pattern |
|---|---|
| Keep all messages in memory | Trim or summarize old messages |
| Send full chat history every turn | Keep only recent turns |
| No token accounting | Enforce a max token budget |
# BROKEN
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage
llm = ChatOpenAI(model="gpt-4o-mini")
history = [
HumanMessage(content="Hi"),
AIMessage(content="Hello"),
# ... grows forever
]
def ask(question: str):
messages = history + [HumanMessage(content=question)]
return llm.invoke(messages)
# FIXED
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.messages.utils import trim_messages
llm = ChatOpenAI(model="gpt-4o-mini")
history = [
HumanMessage(content="Hi"),
AIMessage(content="Hello"),
]
def ask(question: str):
messages = history + [HumanMessage(content=question)]
trimmed = trim_messages(
messages,
max_tokens=6000,
strategy="last",
token_counter=llm,
include_system=True,
allow_partial=False,
)
return llm.invoke(trimmed)
If you’re using RunnableWithMessageHistory, the same problem applies. The wrapper does not magically solve token growth; it just stores messages for you.
Other Possible Causes
1) Retriever returning too many documents
A RetrievalQA or custom RAG chain can blow up when k is too high or chunks are too large.
# Too much context from retrieval
retriever = vectorstore.as_retriever(search_kwargs={"k": 12})
Fix it by lowering k and chunk size:
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
Also make sure your splitter isn’t producing giant chunks:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
)
2) Tool output being injected into the prompt
Agents often pass raw tool output back into the model. If a tool returns an entire CSV, PDF extraction, or API payload, your prompt can explode.
# Problematic tool output
tool_result = get_claims_export() # huge JSON blob
messages.append({"role": "tool", "content": tool_result})
Trim or summarize tool output before returning it to the agent:
summary = summarize_tool_output(tool_result)
messages.append({"role": "tool", "content": summary})
3) System prompt is too large
This happens when teams paste policies, SOPs, product docs, or long compliance instructions directly into the system message.
system_prompt = open("all_policies.txt").read()
Move static content into retrieval instead of stuffing it into the prompt. Keep the system message short and stable.
4) Memory plus retrieved context plus user input exceeds budget
Even if each piece looks reasonable alone, the combined payload can exceed limits.
# History + docs + user question all at once
prompt = system_prompt + chat_history + retrieved_docs + [user_question]
Set a hard token budget for each component:
- •system: small and fixed
- •history: recent turns only
- •retrieval: top 3–5 chunks
- •user input: raw but bounded
How to Debug It
- •
Print the final prompt size before calling the model
- •Log message count and approximate token count.
- •In LangChain, inspect what actually gets passed to
llm.invoke()orchain.invoke().
- •
Isolate each input source
- •Remove chat history.
- •Remove retriever results.
- •Remove tools.
- •Add them back one at a time until the error returns.
- •
Check which class is building the prompt
- •Common culprits:
- •
ConversationBufferMemory - •
RunnableWithMessageHistory - •
RetrievalQA - •
create_retrieval_chain - •custom agent/tool wrappers
- •
- •Common culprits:
- •
Compare against model limits
- •Example:
- •
gpt-4o-minihas a much smaller practical budget than larger-context models. - •If you’re using an 8k/16k context model, long histories will fail quickly.
- •
- •Look for errors like:
- •
This model's maximum context length is ... - •
requested ... tokens (prompt + completion)
- •
- •Example:
Prevention
- •Use token-aware trimming from day one.
- •Prefer
trim_messages()or summary memory over raw append-only buffers.
- •Prefer
- •Cap retrieval aggressively.
- •Start with
k=3ork=4, then increase only if evaluation proves it helps.
- •Start with
- •Add prompt-size logging in production.
- •Track message count, estimated tokens, retriever chunk sizes, and tool payload sizes per request.
If you’re building LangChain apps for production, treat context as a budgeted resource. The fix is usually not “use a bigger model” — it’s controlling what enters the prompt in the first place.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit