How to Fix 'context length exceeded when scaling' in LlamaIndex (Python)
When you see ValueError: context length exceeded or a variant like Token limit exceeded while scaling a LlamaIndex pipeline, it means you’re sending more text into the model than the model’s context window can hold. This usually shows up when your index grows, your chunking is too aggressive, or you start stuffing too many retrieved nodes into a single prompt.
In practice, this happens during query time more often than ingestion. The usual pattern is: it works on a small dataset, then breaks once retrieval returns more nodes or your prompt template gets longer.
The Most Common Cause
The #1 cause is building a prompt that includes too much retrieved content. In LlamaIndex, this usually happens with ResponseMode.COMPACT, ResponseMode.TREE_SUMMARIZE, or custom prompts that concatenate large chunks without controlling token budget.
Here’s the broken pattern:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.response_synthesizers import ResponseMode
docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(
response_mode=ResponseMode.COMPACT,
similarity_top_k=20,
)
response = query_engine.query("Summarize the policy changes")
print(response)
And here’s the fixed version:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.response_synthesizers import ResponseMode
docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(
response_mode=ResponseMode.SIMPLE_SUMMARIZE,
similarity_top_k=5,
)
response = query_engine.query("Summarize the policy changes")
print(response)
What changed:
- •
similarity_top_kwent from20to5 - •
COMPACTwas replaced withSIMPLE_SUMMARIZE - •fewer nodes are stuffed into the prompt
If you need higher recall, don’t just increase top_k. Use reranking or a two-step retrieval flow instead of dumping everything into one completion.
Other Possible Causes
1) Chunk size is too large
If your chunks are huge, each retrieved node eats most of the context window.
from llama_index.core.node_parser import SentenceSplitter
parser = SentenceSplitter(chunk_size=4096, chunk_overlap=200)
Use something closer to:
from llama_index.core.node_parser import SentenceSplitter
parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
Large chunks are especially bad when combined with long system prompts or multi-turn chat history.
2) You are passing chat history into every query
A common mistake in agent loops is appending full conversation history on every call.
# Broken: keeps growing forever
chat_history.append(user_msg)
chat_history.append(assistant_msg)
response = query_engine.query(
f"History: {chat_history}\n\nQuestion: {user_question}"
)
Fix it by truncating history or summarizing older turns:
recent_history = chat_history[-6:]
response = query_engine.query(
f"Recent history: {recent_history}\n\nQuestion: {user_question}"
)
If you’re using ChatEngine, make sure memory has a bounded token limit.
3) Your prompt template is too verbose
Long instructions plus retrieved text can push you over the edge.
from llama_index.core.prompts import PromptTemplate
qa_prompt = PromptTemplate("""
You are an expert compliance assistant.
Follow all bank policy rules.
Do not miss any detail.
Explain everything in full.
Use complete citations.
Here is the context:
{context_str}
Question: {query_str}
""")
Trim it down:
qa_prompt = PromptTemplate("""
Answer using only the provided context.
If the answer is missing, say so.
Context:
{context_str}
Question: {query_str}
""")
Shorter prompts matter more than people expect.
4) You are using a smaller model than your index/query setup assumes
A model with an 8k context window will fail if your retrieval and prompt assembly assume 16k or 32k.
Example mismatch:
Settings.llm = OpenAI(model="gpt-3.5-turbo") # smaller context window
If your workload needs more room:
Settings.llm = OpenAI(model="gpt-4o-mini") # larger practical budget depending on provider config
Also check whether your embedding/retrieval flow is optimized for that model’s actual limit.
How to Debug It
- •
Print token sizes before calling the LLM
- •Log how many tokens are in:
- •system prompt
- •user prompt
- •retrieved nodes
- •chat history
- •Log how many tokens are in:
- •
Reduce one variable at a time
- •Set
similarity_top_k=1 - •Switch to
ResponseMode.SIMPLE_SUMMARIZE - •Remove chat history
- •Shrink chunk size
- •Set
- •
Inspect retrieved nodes
- •Dump node text lengths and metadata
- •Look for giant PDFs, OCR noise, duplicated sections, or tables exploding token count
- •
Check the exact exception path
- •Common messages include:
- •
ValueError: Requested tokens exceed context window - •
context length exceeded - •provider-specific errors from OpenAI / Anthropic wrappers
- •
- •If it fails inside synthesis, it’s usually retrieval + prompt size.
- •If it fails during ingestion, it’s usually chunking or parsing.
- •Common messages include:
Prevention
- •Keep chunks small enough for retrieval to stay cheap:
- •start with
chunk_size=512and adjust from there
- •start with
- •Cap retrieval aggressively:
- •use lower
similarity_top_k - •add rerankers instead of brute-force stuffing more nodes
- •use lower
- •Treat prompt size as a budget:
- •reserve space for instructions, chat history, and output tokens before adding context
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit