How to Fix 'context length exceeded' in LlamaIndex (Python)
What the error means
context length exceeded in LlamaIndex usually means you sent more tokens to an LLM than the model can handle in one request. In practice, this shows up when you stuff too many retrieved chunks into a prompt, build a query engine with oversized context, or pass a huge chat history into an agent.
The exact failure often looks like one of these:
- •
ValueError: Input length exceeds maximum context length - •
BadRequestError: This model's maximum context length is ... - •
context_length_exceeded
The Most Common Cause
The #1 cause is overloading the prompt with too many retrieved nodes.
A common mistake is using a retriever that returns too many chunks, then feeding all of them into a ResponseSynthesizer or query engine without controlling similarity_top_k, chunk size, or response mode.
| Broken pattern | Fixed pattern |
|---|---|
| Retrieve too much context and send it all to the LLM | Limit retrieval and use compact response synthesis |
# BROKEN: too many nodes get stuffed into the prompt
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
similarity_top_k=20, # too high for long documents
response_mode="compact"
)
response = query_engine.query("Summarize the policy exclusions")
print(response)
# FIXED: reduce retrieved context and control chunking
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex
Settings.chunk_size = 512
Settings.chunk_overlap = 50
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
similarity_top_k=4,
response_mode="compact"
)
response = query_engine.query("Summarize the policy exclusions")
print(response)
If you’re using RetrieverQueryEngine, the same rule applies. Don’t assume more retrieved chunks means better answers. In insurance and banking docs, 3-5 relevant chunks usually beats 20 noisy ones.
Other Possible Causes
1) Your document chunks are too large
If your ingestion pipeline creates giant nodes, each retrieval already consumes most of the model window.
from llama_index.core import Settings
# Too large for many models when combined with prompt + question
Settings.chunk_size = 4096
Settings.chunk_overlap = 200
Use smaller chunks unless you have a specific reason not to.
Settings.chunk_size = 512
Settings.chunk_overlap = 50
2) Your chat history is being passed in full
This happens with agents and memory-backed chat engines. The conversation grows until the next turn fails.
# BROKEN: unbounded memory growth
chat_engine = index.as_chat_engine(chat_mode="context")
response = chat_engine.chat("What does section 4 mean?")
Fix it by limiting memory or summarizing older turns.
from llama_index.core.memory import ChatMemoryBuffer
memory = ChatMemoryBuffer.from_defaults(token_limit=2000)
chat_engine = index.as_chat_engine(
chat_mode="context",
memory=memory,
)
3) You’re using a smaller-context model than you think
A model swap can break a working pipeline. For example, code that fits on an 8k model may fail on a 4k one.
# Example: check your LLM config explicitly
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini") # verify actual context window from provider docs
If you changed providers or models recently, inspect the max context window before changing your RAG pipeline.
4) Prompt templates are bloated
Sometimes the issue isn’t retrieval; it’s your custom prompt template. Long instructions, examples, and policy text add up fast.
from llama_index.core.prompts import PromptTemplate
template = PromptTemplate("""
You are a compliance assistant.
[... hundreds of lines of instructions ...]
Context:
{context_str}
Question:
{query_str}
""")
Trim instructions and move stable rules into system prompts or separate guardrails outside the retrieval prompt.
How to Debug It
- •
Check the exact exception text
- •If you see
ValueError: Input length exceeds maximum context length, it’s usually prompt assembly. - •If you see provider errors like
BadRequestError, the final request exceeded the model window.
- •If you see
- •
Print what is actually being sent
- •Log retrieved node lengths.
- •Log prompt size before calling the LLM.
- •In LlamaIndex, inspect retrieved content from your retriever or query engine before synthesis.
- •
Reduce variables one at a time
- •Set
similarity_top_k=1. - •Lower
chunk_size. - •Disable chat memory.
- •Switch to a shorter prompt template.
- •Re-run after each change until it stops failing.
- •Set
- •
Test against a larger-context model
- •If the error disappears on a larger window, your pipeline is simply too big.
- •That confirms you need smaller chunks, fewer retrieved nodes, or summarized memory.
Prevention
- •Keep chunk sizes conservative: start around
512to1024tokens unless you have measured evidence otherwise. - •Cap retrieval aggressively:
similarity_top_k=3to5is usually enough for most enterprise RAG queries. - •Budget tokens explicitly for:
- •system prompt
- •user question
- •retrieved context
- •chat history
If you treat token budget as a hard constraint instead of an afterthought, this error stops showing up in production.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit