How to Fix 'context length exceeded' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21
context-length-exceededlangchainpython

When you see context length exceeded in LangChain, the model is telling you that the total tokens in your prompt + chat history + retrieved documents + tool output are bigger than the model can handle. It usually shows up in chat apps, retrieval chains, or agent loops where memory keeps growing until the next LLM call fails.

In LangChain Python, this often surfaces as a provider error like:

  • openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 16385 tokens..."}}
  • litellm.BadRequestError: context_length_exceeded
  • anthropic.BadRequestError: prompt is too long

The Most Common Cause

The #1 cause is unbounded conversation history or chain input growth.

A classic example is a ConversationBufferMemory or agent loop that keeps appending every message forever. Eventually, the next call to ChatOpenAI, ChatAnthropic, or another chat model blows past the context window.

Broken patternFixed pattern
Keep all messages foreverTrim, summarize, or window the memory
Pass full docs + full chat history every turnLimit retrieved chunks and memory size
# BROKEN: memory grows without bound
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

llm = ChatOpenAI(model="gpt-4o-mini")
memory = ConversationBufferMemory(return_messages=True)

chain = ConversationChain(llm=llm, memory=memory)

for msg in user_messages:
    print(chain.invoke({"input": msg}))
# FIXED: use a bounded memory strategy
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationSummaryBufferMemory
from langchain.chains import ConversationChain

llm = ChatOpenAI(model="gpt-4o-mini")
memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=2000,
    return_messages=True,
)

chain = ConversationChain(llm=llm, memory=memory)

for msg in user_messages:
    print(chain.invoke({"input": msg}))

If you want stricter control, use a windowed memory instead:

from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=6, return_messages=True)

That keeps only the last k turns instead of the entire transcript.

Other Possible Causes

1) Retrieval returns too many chunks

If you're using RetrievalQA, ConversationalRetrievalChain, or a custom RAG pipeline, your retriever may be stuffing too many documents into the prompt.

# BROKEN
retriever = vectorstore.as_retriever(search_kwargs={"k": 12})
# FIXED
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

If chunks are large, reduce chunk size too:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
)

2) Tool output is being fed back raw

Agents can explode token usage when tool outputs are long and get appended to every loop iteration.

# BROKEN: huge tool output goes straight into the scratchpad/history
result = agent_executor.invoke({"input": "Summarize this PDF"})

Fix by truncating tool output or summarizing before re-inserting it:

def truncate(text: str, limit: int = 3000) -> str:
    return text[:limit]

safe_output = truncate(tool_result)

3) Prompt template is too verbose

Sometimes the issue is not memory. It’s just an oversized system prompt plus examples plus user input.

# BROKEN: giant static prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", VERY_LONG_POLICY_TEXT),
    ("human", "{question}")
])

Trim policy text and move stable instructions into shorter rules. If you need examples, keep one or two high-signal examples instead of dumping an entire spec.

4) Wrong model context window assumption

A lot of people assume all GPT-style models have the same limit. They don’t.

Model familyTypical issue
Smaller chat modelsEasy to exceed with RAG + history
Older OpenAI modelsLower context windows than expected
Anthropic / Bedrock wrappersProvider-specific prompt limits

Check your actual model name:

llm = ChatOpenAI(model="gpt-4o-mini")  # not "whatever default"

Then verify its max context window from the provider docs before designing your chain.

How to Debug It

  1. Measure tokens before calling the model

    Use a tokenizer or LangChain’s message formatting path to estimate prompt size.

    from tiktoken import encoding_for_model
    
    enc = encoding_for_model("gpt-4o-mini")
    token_count = len(enc.encode(full_prompt_text))
    print(token_count)
    
  2. Log each component separately

    Break down:

    • system prompt
    • chat history
    • retrieved documents
    • tool output
    • user input

    One of these will usually dominate.

  3. Reduce one variable at a time

    Temporarily set:

    • retriever k=2
    • memory window to 2 turns
    • tool output truncation to 1000 chars

    If the error disappears, you found the source.

  4. Inspect the exact provider error

    Don’t stop at LangChain’s wrapper exception. Look for messages like:

    • maximum context length
    • prompt is too long
    • context_length_exceeded
    • This model's maximum context length is ...

    That tells you whether it’s input size, output size, or both.

Prevention

  • Use bounded memory by default:

    • ConversationBufferWindowMemory
    • ConversationSummaryBufferMemory
    • custom summarization after N turns
  • Put hard limits on retrieval and tools:

    • smaller k
    • smaller chunk sizes
    • truncate long tool outputs before they hit the LLM
  • Add token checks in CI or preflight logging for any production chain that combines:

    • chat history
    • RAG context
    • agent tools

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides