How to Fix 'token limit exceeded' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceededlangchainpython

token limit exceeded in LangChain means the model input got bigger than the context window allowed by the LLM provider. In practice, this usually happens when you keep stuffing chat history, retrieved documents, or long prompts into a chain without trimming anything.

You’ll see it most often in ConversationalRetrievalChain, ConversationBufferMemory, StuffDocumentsChain, or any custom prompt that keeps growing on every turn.

The Most Common Cause

The #1 cause is unbounded conversation memory. ConversationBufferMemory keeps appending every user message and assistant reply, then LangChain sends the whole transcript back to the model on each call.

That works for short chats. It breaks fast once the conversation gets long.

Broken patternFixed pattern
ConversationBufferMemory() with no limitConversationTokenBufferMemory() or summary-based memory
Full chat history sent every turnOld messages trimmed or summarized
Token growth on every requestToken budget stays bounded

Wrong

from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

llm = ChatOpenAI(model="gpt-3.5-turbo")
memory = ConversationBufferMemory()

chain = ConversationChain(llm=llm, memory=memory)

print(chain.predict(input="Explain our claims workflow."))
print(chain.predict(input="Now add exceptions for fraud review."))
# Eventually:
# openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is 4096 tokens...'}}

Right

from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationTokenBufferMemory

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

memory = ConversationTokenBufferMemory(
    llm=llm,
    max_token_limit=1500,
)

chain = ConversationChain(llm=llm, memory=memory)

print(chain.predict(input="Explain our claims workflow."))
print(chain.predict(input="Now add exceptions for fraud review."))

If you need better long-term retention, use summary memory instead of raw buffer memory:

from langchain.memory import ConversationSummaryBufferMemory

memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=1500,
)

Other Possible Causes

1) You are stuffing too many retrieved documents into the prompt

This is common with RetrievalQA, ConversationalRetrievalChain, and StuffDocumentsChain. If your retriever returns 10 long policy PDFs, LangChain will concatenate them and blow past the limit.

from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",   # risky with large docs
    retriever=retriever,
)

Fix it by reducing k, chunk size, or switching away from "stuff":

retriever.search_kwargs["k"] = 3

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_reduce",
    retriever=retriever,
)

2) Your prompt template is too verbose

A giant system prompt plus instructions plus examples can consume a surprising amount of context before the user even types anything.

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", """
You are an expert insurance assistant.
Follow these 40 rules...
Here are 12 examples...
"""),
    ("human", "{question}")
])

Trim it hard. Keep only what changes model behavior:

prompt = ChatPromptTemplate.from_messages([
    ("system", "You answer insurance questions using only provided policy context."),
    ("human", "{question}")
])

3) Your document chunks are too large

If you split PDFs into huge chunks, retrieval brings back oversized text blocks that eat the entire budget.

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,
    chunk_overlap=500,
)

Use smaller chunks for RAG:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
)

4) You are using the wrong model for the context size

Sometimes the code is fine, but the model is not. A small-context model will fail where a larger-context one would work.

llm = ChatOpenAI(model="gpt-3.5-turbo")  # smaller context than newer long-context models

Switch to a larger context model if your use case needs it:

llm = ChatOpenAI(model="gpt-4o-mini")

How to Debug It

  1. Read the actual provider error

    • OpenAI-style errors usually look like:
      • BadRequestError: Error code: 400
      • "This model's maximum context length is X tokens"
    • Anthropic-style errors often mention:
      • "prompt is too long"
    • This tells you whether you hit a hard token ceiling or a malformed request.
  2. Check what LangChain is sending

    • Log prompt size before calling the chain.
    • Inspect memory content, retrieved docs, and system prompts.
    • If you’re using callbacks or tracing, print the final assembled messages.
messages = prompt.format_messages(question=user প্রশ্ন)
for m in messages:
    print(type(m).__name__, len(m.content))
  1. Isolate the source

    • Disable memory first.
    • Then disable retrieval.
    • Then shrink the system prompt.
    • Re-enable pieces one at a time until it breaks again.
  2. Measure token usage directly

    • Use your tokenizer or provider usage metadata.
    • In OpenAI responses, check usage.prompt_tokens.
    • If prompt tokens are near model max before generation starts, you found it.

Prevention

  • Use bounded memory by default:
    • ConversationTokenBufferMemory
    • ConversationSummaryBufferMemory
  • Put hard limits on retrieval:
    • lower k
    • smaller chunks
    • avoid "stuff" for large corpora
  • Treat token budget as a first-class constraint:
    • reserve space for output
    • leave headroom for tool calls and retries

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides