How to Fix 'token limit exceeded when scaling' in LlamaIndex (Python)
When you see ValueError: token limit exceeded when scaling in LlamaIndex, it usually means one of your index or retrieval settings is trying to pack too much text into a model call. This shows up most often during ingestion, query synthesis, or when a retriever returns too many large chunks at once.
In practice, this is not a “LlamaIndex is broken” problem. It’s usually a chunking, retrieval, or prompt-size issue that only appears once your data grows past the toy dataset stage.
The Most Common Cause
The #1 cause is feeding oversized chunks into an index or retriever, then asking LlamaIndex to scale them into a prompt that exceeds the token budget.
A common broken pattern is using large chunk_size values and then retrieving too many nodes:
| Broken | Fixed |
|---|---|
| Large chunks + high top-k | Smaller chunks + bounded retrieval |
# Broken: too much text per node, then too many nodes retrieved
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
documents = SimpleDirectoryReader("data").load_data()
splitter = SentenceSplitter(chunk_size=4096, chunk_overlap=200)
nodes = splitter.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine(similarity_top_k=10)
response = query_engine.query(
"Summarize the policy exclusions and claim limits."
)
print(response)
# Fixed: smaller chunks and tighter retrieval
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
documents = SimpleDirectoryReader("data").load_data()
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query(
"Summarize the policy exclusions and claim limits."
)
print(response)
Why this works:
- •Smaller chunks reduce per-node token size.
- •Lower
similarity_top_kreduces how many chunks get stuffed into the prompt. - •The combined context stays under the model’s limit instead of exploding during scaling.
Other Possible Causes
1. Your prompt template is bloated
If you are using PromptTemplate, ChatPromptTemplate, or a custom system prompt with long instructions, you can burn most of the context window before retrieval even starts.
from llama_index.core.prompts import PromptTemplate
prompt = PromptTemplate(
"You are an assistant.\n" + ("Rules:\n" * 1000) + "{context_str}\nQuestion: {query_str}"
)
Keep prompts short and move policy detail into documents or structured tools.
2. Recursive retrieval is multiplying context
Query engines like SubQuestionQueryEngine or multi-step agents can fan out into multiple subqueries. Each step adds more retrieved text, and the final synthesis step may hit:
- •
ValueError: token limit exceeded when scaling - •
RuntimeError: maximum context length exceeded
Example:
from llama_index.core.query_engine import SubQuestionQueryEngine
query_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=tools)
If each tool returns large context, cap tool output or reduce per-tool similarity_top_k.
3. Metadata is being included in every node
If your documents carry huge metadata fields, LlamaIndex may include them in node serialization or prompt construction depending on your pipeline.
# Bad: stuffing full JSON blobs into metadata
doc.metadata["raw_payload"] = huge_json_blob
Instead:
# Better: keep metadata small and indexed
doc.metadata["source"] = "claims_policy_2024"
doc.metadata["page"] = 12
Large metadata belongs in object storage or a database, not in every indexed node.
4. You are using a model with a smaller context window than you think
This happens when the embedding model is fine but the LLM used for synthesis has a much smaller context limit than expected.
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-3.5-turbo") # smaller effective context than newer models
If your workload needs more room, switch to a larger-context model and still keep retrieval bounded.
How to Debug It
- •
Print your chunk sizes
- •Check how big each node is before indexing.
- •If you see multi-thousand-token chunks, that’s your first problem.
- •
Lower retrieval depth
- •Set
similarity_top_k=1or2. - •If the error disappears, you were overfilling the prompt with retrieved nodes.
- •Set
- •
Inspect the final prompt size
- •Log the assembled prompt if you are using custom query engines or response synthesizers.
- •Look for repeated instructions, duplicated context, or giant metadata blocks.
- •
Test with a smaller model/context
- •Swap in a known larger-context LLM.
- •If it works there but fails on your current model, you’ve confirmed a context-window mismatch.
Prevention
- •Use conservative defaults:
- •
chunk_size=256to512 - •
similarity_top_k=2to4
- •
- •Keep prompts short and deterministic.
- •Treat metadata as identifiers, not document storage.
- •Add a regression test that runs one representative long query against real production-sized docs.
A good rule: if your retrieval pipeline depends on “just let it grab more text,” it will fail later under real data volume. Control chunking first, then control top-k, then control synthesis length.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit