How to Fix 'token limit exceeded during development' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21

token-limit-exceeded-during-developmentlangchainpython

When you see token limit exceeded during development in a LangChain Python app, it usually means your prompt assembly is growing faster than you expect. In practice, it shows up when you stuff too much chat history, retrieved context, or tool output into a model call and the final request crosses the model’s context window.

The failure is often not in the LLM call itself. It’s usually in how you build the input chain before the call reaches ChatOpenAI, ChatAnthropic, or whatever chat model wrapper you’re using.

The Most Common Cause

The #1 cause is unbounded conversation memory.

A lot of developers wire ConversationBufferMemory into an agent or chain and keep appending messages forever. That works for a few turns, then explodes once the accumulated messages plus system prompt plus user input exceed the model limit.

Here’s the broken pattern and the fixed pattern side by side:

Broken	Fixed
Keeps every message forever	Trims to recent turns or token budget
Uses `ConversationBufferMemory` blindly	Uses `ConversationBufferWindowMemory` or token-aware memory
No guard before sending to model	Checks prompt size before invoke

# BROKEN
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferMemory(return_messages=True)

chain = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True,
)

for i in range(100):
    chain.invoke({"input": f"Turn {i}: explain the policy change again"})

# FIXED
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationChain

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferWindowMemory(
    k=6,              # keep last 6 turns only
    return_messages=True,
)

chain = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True,
)

for i in range(100):
    chain.invoke({"input": f"Turn {i}: explain the policy change again"})

If you want stricter control, use token-aware trimming instead of turn-count trimming. For production systems, that’s usually better because some turns are much larger than others.

from langchain_core.messages import trim_messages

trimmed_messages = trim_messages(
    messages=history,
    max_tokens=3000,
    token_counter=llm.get_num_tokens_from_messages,
)

Other Possible Causes

1. Retriever returns too many chunks

If you use RAG and pull back 10 large chunks, your context can blow up even with short chat history.

# Too much context
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
docs = retriever.invoke("claims processing requirements")

Fix it by reducing k, chunk size, or both.

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

2. Your prompt template duplicates content

This happens when you pass the same document text into multiple placeholders.

# BROKEN: same content injected twice
prompt = ChatPromptTemplate.from_messages([
    ("system", "Use this context: {context}\n\nContext again: {context}"),
    ("human", "{question}")
])

Fix it by keeping one source of truth in the template.

prompt = ChatPromptTemplate.from_messages([
    ("system", "Use this context:\n{context}"),
    ("human", "{question}")
])

3. Tool output is being appended raw

Agents can dump massive JSON blobs from tools into the next LLM call. That’s a common cause with AgentExecutor and function-calling workflows.

# Bad: raw tool output goes straight back into memory/history
result = agent_executor.invoke({"input": user_input})

Add output truncation or summarize tool responses before they re-enter the prompt.

def clip(text: str, max_chars: int = 4000) -> str:
    return text[:max_chars]

tool_result = clip(raw_tool_output)

4. Model context window is smaller than you think

Some models have much smaller limits than GPT-4 class models. If you swap providers or downgrade models during development, your previously working prompt may start failing.

Typical error variants include:

•BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is ...'}}
•context_length_exceeded
•Token indices sequence length is longer than the specified maximum sequence length

Check your exact model name and its documented context window before blaming LangChain.

How to Debug It

•
Print the final message list before calling the model
- •Inspect system messages, chat history, retrieved docs, and tool outputs.
- •In LangChain, log what you send to ChatOpenAI.invoke() or inside your chain callback.
•
Measure token count at each stage
- •Count tokens after prompt assembly, after retrieval, and after memory injection.
- •If your model wrapper supports it, use get_num_tokens_from_messages().
•
Disable components one by one
- •Turn off memory first.
- •Then reduce retriever k.
- •Then remove tools.
- •The component that makes the error disappear is your culprit.
•
Compare against model limits
- •Check input tokens + expected output tokens.
- •Leave headroom for generation.
- •If your prompt uses 7k tokens on an 8k-context model, it will fail under load.

Prevention

•Use token-aware trimming for chat history instead of unbounded buffers.
•Cap retriever results and summarize long documents before they hit the prompt.
•Add a preflight token check in your chain wrapper so oversized requests fail early with a clear message.

If you’re building agents for production, treat context like memory management in backend systems: every byte has a cost. LangChain won’t save you from oversized prompts unless you explicitly design for it.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit