How to Fix 'token limit exceeded in production' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceeded-in-productionlangchainpython

A token limit exceeded in production error in LangChain usually means your prompt, retrieved context, chat history, or tool output pushed the model past its context window. In practice, it shows up when a chain works fine in local tests but starts failing once real user conversations and larger documents hit the same path.

The fix is almost never “increase the limit.” You need to find which part of your input is growing uncontrollably and trim it before it reaches the model.

The Most Common Cause

The #1 cause is unbounded conversation history being stuffed into every request. This happens a lot with ConversationBufferMemory, MessagesPlaceholder, or manual prompt assembly where you keep appending messages forever.

Here’s the broken pattern:

from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferMemory(return_messages=True)

chain = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True,
)

# Every turn gets appended forever
response = chain.predict(input="Summarize my policy claim.")

And here’s the fixed pattern using bounded memory:

from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationChain

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferWindowMemory(k=6, return_messages=True)

chain = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True,
)

response = chain.predict(input="Summarize my policy claim.")

If you’re using LCEL or a custom prompt with MessagesPlaceholder, keep the same rule: only pass the last N turns or summarize older turns.

BrokenFixed
ConversationBufferMemory()ConversationBufferWindowMemory(k=6)
Append all messages foreverKeep last 4–10 turns
Full transcript in every callSummarize old context

Other Possible Causes

1) Retrieval returns too many chunks

If you’re using RetrievalQA, ConversationalRetrievalChain, or a custom retriever, you may be stuffing too many chunks into the prompt.

retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

Fix it by reducing k and chunk size:

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

Also check your splitter:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)

If your chunks are 2,000+ tokens each, even k=3 can blow the context window.

2) Tool output is being injected raw into the prompt

Agent tool calls can return large JSON blobs, HTML pages, or long database rows. If you feed that raw into the next LLM call, token usage spikes fast.

Broken:

tool_result = run_report()
prompt = f"Use this data:\n{tool_result}"

Fixed:

tool_result = run_report()
prompt = f"Use this summary:\n{tool_result[:2000]}"

Better still: summarize tool output before passing it back to the agent.

3) Your system prompt is too large

This happens in enterprise apps where people paste policies, compliance text, routing rules, and examples into one giant system message.

Broken:

system_prompt = open("all_policies.txt").read()

Fixed:

system_prompt = """
You are an insurance claims assistant.
Follow these rules:
- Ask for missing policy number
- Never invent coverage details
- Escalate fraud indicators to a human
"""

Keep long policy docs out of the system prompt. Store them in retrieval and fetch only relevant sections.

4) You are using a model with a smaller context window than you think

A lot of production failures come from switching models without updating assumptions. For example, code that worked on a larger-context model may fail when pointed at a smaller one through config drift.

Check what you actually deployed:

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

If your input size is large, move to a larger-context model or reduce input size. Don’t assume all chat models have the same window.

How to Debug It

  1. Log token counts before every LLM call
    Use tiktoken or LangChain callbacks to measure prompt size. If you don’t know input size, you’re guessing.

  2. Print the exact payload going into the model
    Inspect:

    • system message
    • user message
    • chat history
    • retrieved documents
    • tool outputs
  3. Reduce one variable at a time
    Test with:

    • no memory
    • no retriever results
    • no tool outputs

    When the error disappears, you found the source.

  4. Catch model-side context errors explicitly
    Real errors often look like this:

try:
    response = chain.invoke({"input": user_input})
except Exception as e:
    print(type(e).__name__, str(e))

Common messages include:

  • BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is ...'}}
  • openai.BadRequestError: Request too large for gpt-4o-mini
  • ValueError: Prompt too long
  • provider-specific truncation errors from Anthropic or Azure OpenAI

Prevention

  • Use bounded memory from day one: ConversationBufferWindowMemory, summarization memory, or manual truncation.
  • Cap retrieval aggressively:
    • smaller chunks
    • lower k
    • rerank before stuffing documents into prompts
  • Add token budget checks before calling the model:
    • reject oversized requests early
    • summarize or truncate automatically

If this failed only in production, treat it as an input-growth problem, not an LLM problem. LangChain will happily assemble a giant request for you; it’s on your code to keep that request inside the model’s context window.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides