How to Fix 'context length exceeded' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
context-length-exceededllamaindexpython

What the error means

context length exceeded in LlamaIndex usually means you sent more tokens to an LLM than the model can handle in one request. In practice, this shows up when you stuff too many retrieved chunks into a prompt, build a query engine with oversized context, or pass a huge chat history into an agent.

The exact failure often looks like one of these:

  • ValueError: Input length exceeds maximum context length
  • BadRequestError: This model's maximum context length is ...
  • context_length_exceeded

The Most Common Cause

The #1 cause is overloading the prompt with too many retrieved nodes.

A common mistake is using a retriever that returns too many chunks, then feeding all of them into a ResponseSynthesizer or query engine without controlling similarity_top_k, chunk size, or response mode.

Broken patternFixed pattern
Retrieve too much context and send it all to the LLMLimit retrieval and use compact response synthesis
# BROKEN: too many nodes get stuffed into the prompt
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(
    similarity_top_k=20,   # too high for long documents
    response_mode="compact"
)

response = query_engine.query("Summarize the policy exclusions")
print(response)
# FIXED: reduce retrieved context and control chunking
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex

Settings.chunk_size = 512
Settings.chunk_overlap = 50

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(
    similarity_top_k=4,
    response_mode="compact"
)

response = query_engine.query("Summarize the policy exclusions")
print(response)

If you’re using RetrieverQueryEngine, the same rule applies. Don’t assume more retrieved chunks means better answers. In insurance and banking docs, 3-5 relevant chunks usually beats 20 noisy ones.

Other Possible Causes

1) Your document chunks are too large

If your ingestion pipeline creates giant nodes, each retrieval already consumes most of the model window.

from llama_index.core import Settings

# Too large for many models when combined with prompt + question
Settings.chunk_size = 4096
Settings.chunk_overlap = 200

Use smaller chunks unless you have a specific reason not to.

Settings.chunk_size = 512
Settings.chunk_overlap = 50

2) Your chat history is being passed in full

This happens with agents and memory-backed chat engines. The conversation grows until the next turn fails.

# BROKEN: unbounded memory growth
chat_engine = index.as_chat_engine(chat_mode="context")
response = chat_engine.chat("What does section 4 mean?")

Fix it by limiting memory or summarizing older turns.

from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=2000)

chat_engine = index.as_chat_engine(
    chat_mode="context",
    memory=memory,
)

3) You’re using a smaller-context model than you think

A model swap can break a working pipeline. For example, code that fits on an 8k model may fail on a 4k one.

# Example: check your LLM config explicitly
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini")  # verify actual context window from provider docs

If you changed providers or models recently, inspect the max context window before changing your RAG pipeline.

4) Prompt templates are bloated

Sometimes the issue isn’t retrieval; it’s your custom prompt template. Long instructions, examples, and policy text add up fast.

from llama_index.core.prompts import PromptTemplate

template = PromptTemplate("""
You are a compliance assistant.
[... hundreds of lines of instructions ...]
Context:
{context_str}

Question:
{query_str}
""")

Trim instructions and move stable rules into system prompts or separate guardrails outside the retrieval prompt.

How to Debug It

  1. Check the exact exception text

    • If you see ValueError: Input length exceeds maximum context length, it’s usually prompt assembly.
    • If you see provider errors like BadRequestError, the final request exceeded the model window.
  2. Print what is actually being sent

    • Log retrieved node lengths.
    • Log prompt size before calling the LLM.
    • In LlamaIndex, inspect retrieved content from your retriever or query engine before synthesis.
  3. Reduce variables one at a time

    • Set similarity_top_k=1.
    • Lower chunk_size.
    • Disable chat memory.
    • Switch to a shorter prompt template.
    • Re-run after each change until it stops failing.
  4. Test against a larger-context model

    • If the error disappears on a larger window, your pipeline is simply too big.
    • That confirms you need smaller chunks, fewer retrieved nodes, or summarized memory.

Prevention

  • Keep chunk sizes conservative: start around 512 to 1024 tokens unless you have measured evidence otherwise.
  • Cap retrieval aggressively: similarity_top_k=3 to 5 is usually enough for most enterprise RAG queries.
  • Budget tokens explicitly for:
    • system prompt
    • user question
    • retrieved context
    • chat history

If you treat token budget as a hard constraint instead of an afterthought, this error stops showing up in production.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides