How to Fix 'context length exceeded in production' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21
context-length-exceeded-in-productionlangchainpython

When you see context length exceeded in production, it means the prompt you sent to the model is larger than the model’s context window. In LangChain, this usually shows up after you’ve chained together chat history, retrieved documents, tool outputs, and a fresh user message into one request.

This is not a LangChain bug in most cases. It’s almost always an application design issue: you’re feeding too much text into a model that has a hard token limit.

The Most Common Cause

The #1 cause is unbounded conversation history being passed into ChatOpenAI through a chain or agent.

A common failure pattern is appending every prior message forever, then sending the whole list on every turn. Once the conversation gets long enough, OpenAI returns errors like:

  • BadRequestError: This model's maximum context length is 16385 tokens. However, your messages resulted in 18240 tokens.
  • openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length has been exceeded'}}

Broken vs fixed

Broken patternFixed pattern
Keep all messages in memoryTrim or summarize old messages
Send full chat history every turnKeep only recent turns
No token accountingEnforce a max token budget
# BROKEN
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage

llm = ChatOpenAI(model="gpt-4o-mini")

history = [
    HumanMessage(content="Hi"),
    AIMessage(content="Hello"),
    # ... grows forever
]

def ask(question: str):
    messages = history + [HumanMessage(content=question)]
    return llm.invoke(messages)
# FIXED
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.messages.utils import trim_messages

llm = ChatOpenAI(model="gpt-4o-mini")

history = [
    HumanMessage(content="Hi"),
    AIMessage(content="Hello"),
]

def ask(question: str):
    messages = history + [HumanMessage(content=question)]
    trimmed = trim_messages(
        messages,
        max_tokens=6000,
        strategy="last",
        token_counter=llm,
        include_system=True,
        allow_partial=False,
    )
    return llm.invoke(trimmed)

If you’re using RunnableWithMessageHistory, the same problem applies. The wrapper does not magically solve token growth; it just stores messages for you.

Other Possible Causes

1) Retriever returning too many documents

A RetrievalQA or custom RAG chain can blow up when k is too high or chunks are too large.

# Too much context from retrieval
retriever = vectorstore.as_retriever(search_kwargs={"k": 12})

Fix it by lowering k and chunk size:

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

Also make sure your splitter isn’t producing giant chunks:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
)

2) Tool output being injected into the prompt

Agents often pass raw tool output back into the model. If a tool returns an entire CSV, PDF extraction, or API payload, your prompt can explode.

# Problematic tool output
tool_result = get_claims_export()  # huge JSON blob
messages.append({"role": "tool", "content": tool_result})

Trim or summarize tool output before returning it to the agent:

summary = summarize_tool_output(tool_result)
messages.append({"role": "tool", "content": summary})

3) System prompt is too large

This happens when teams paste policies, SOPs, product docs, or long compliance instructions directly into the system message.

system_prompt = open("all_policies.txt").read()

Move static content into retrieval instead of stuffing it into the prompt. Keep the system message short and stable.

4) Memory plus retrieved context plus user input exceeds budget

Even if each piece looks reasonable alone, the combined payload can exceed limits.

# History + docs + user question all at once
prompt = system_prompt + chat_history + retrieved_docs + [user_question]

Set a hard token budget for each component:

  • system: small and fixed
  • history: recent turns only
  • retrieval: top 3–5 chunks
  • user input: raw but bounded

How to Debug It

  1. Print the final prompt size before calling the model

    • Log message count and approximate token count.
    • In LangChain, inspect what actually gets passed to llm.invoke() or chain.invoke().
  2. Isolate each input source

    • Remove chat history.
    • Remove retriever results.
    • Remove tools.
    • Add them back one at a time until the error returns.
  3. Check which class is building the prompt

    • Common culprits:
      • ConversationBufferMemory
      • RunnableWithMessageHistory
      • RetrievalQA
      • create_retrieval_chain
      • custom agent/tool wrappers
  4. Compare against model limits

    • Example:
      • gpt-4o-mini has a much smaller practical budget than larger-context models.
      • If you’re using an 8k/16k context model, long histories will fail quickly.
    • Look for errors like:
      • This model's maximum context length is ...
      • requested ... tokens (prompt + completion)

Prevention

  • Use token-aware trimming from day one.
    • Prefer trim_messages() or summary memory over raw append-only buffers.
  • Cap retrieval aggressively.
    • Start with k=3 or k=4, then increase only if evaluation proves it helps.
  • Add prompt-size logging in production.
    • Track message count, estimated tokens, retriever chunk sizes, and tool payload sizes per request.

If you’re building LangChain apps for production, treat context as a budgeted resource. The fix is usually not “use a bigger model” — it’s controlling what enters the prompt in the first place.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides