How to Fix 'token limit exceeded when scaling' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceeded-when-scalingllamaindexpython

When you see ValueError: token limit exceeded when scaling in LlamaIndex, it usually means one of your index or retrieval settings is trying to pack too much text into a model call. This shows up most often during ingestion, query synthesis, or when a retriever returns too many large chunks at once.

In practice, this is not a “LlamaIndex is broken” problem. It’s usually a chunking, retrieval, or prompt-size issue that only appears once your data grows past the toy dataset stage.

The Most Common Cause

The #1 cause is feeding oversized chunks into an index or retriever, then asking LlamaIndex to scale them into a prompt that exceeds the token budget.

A common broken pattern is using large chunk_size values and then retrieving too many nodes:

BrokenFixed
Large chunks + high top-kSmaller chunks + bounded retrieval
# Broken: too much text per node, then too many nodes retrieved
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter

documents = SimpleDirectoryReader("data").load_data()

splitter = SentenceSplitter(chunk_size=4096, chunk_overlap=200)
nodes = splitter.get_nodes_from_documents(documents)

index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine(similarity_top_k=10)

response = query_engine.query(
    "Summarize the policy exclusions and claim limits."
)
print(response)
# Fixed: smaller chunks and tighter retrieval
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter

documents = SimpleDirectoryReader("data").load_data()

splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(documents)

index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine(similarity_top_k=3)

response = query_engine.query(
    "Summarize the policy exclusions and claim limits."
)
print(response)

Why this works:

  • Smaller chunks reduce per-node token size.
  • Lower similarity_top_k reduces how many chunks get stuffed into the prompt.
  • The combined context stays under the model’s limit instead of exploding during scaling.

Other Possible Causes

1. Your prompt template is bloated

If you are using PromptTemplate, ChatPromptTemplate, or a custom system prompt with long instructions, you can burn most of the context window before retrieval even starts.

from llama_index.core.prompts import PromptTemplate

prompt = PromptTemplate(
    "You are an assistant.\n" + ("Rules:\n" * 1000) + "{context_str}\nQuestion: {query_str}"
)

Keep prompts short and move policy detail into documents or structured tools.

2. Recursive retrieval is multiplying context

Query engines like SubQuestionQueryEngine or multi-step agents can fan out into multiple subqueries. Each step adds more retrieved text, and the final synthesis step may hit:

  • ValueError: token limit exceeded when scaling
  • RuntimeError: maximum context length exceeded

Example:

from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=tools)

If each tool returns large context, cap tool output or reduce per-tool similarity_top_k.

3. Metadata is being included in every node

If your documents carry huge metadata fields, LlamaIndex may include them in node serialization or prompt construction depending on your pipeline.

# Bad: stuffing full JSON blobs into metadata
doc.metadata["raw_payload"] = huge_json_blob

Instead:

# Better: keep metadata small and indexed
doc.metadata["source"] = "claims_policy_2024"
doc.metadata["page"] = 12

Large metadata belongs in object storage or a database, not in every indexed node.

4. You are using a model with a smaller context window than you think

This happens when the embedding model is fine but the LLM used for synthesis has a much smaller context limit than expected.

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")  # smaller effective context than newer models

If your workload needs more room, switch to a larger-context model and still keep retrieval bounded.

How to Debug It

  1. Print your chunk sizes

    • Check how big each node is before indexing.
    • If you see multi-thousand-token chunks, that’s your first problem.
  2. Lower retrieval depth

    • Set similarity_top_k=1 or 2.
    • If the error disappears, you were overfilling the prompt with retrieved nodes.
  3. Inspect the final prompt size

    • Log the assembled prompt if you are using custom query engines or response synthesizers.
    • Look for repeated instructions, duplicated context, or giant metadata blocks.
  4. Test with a smaller model/context

    • Swap in a known larger-context LLM.
    • If it works there but fails on your current model, you’ve confirmed a context-window mismatch.

Prevention

  • Use conservative defaults:
    • chunk_size=256 to 512
    • similarity_top_k=2 to 4
  • Keep prompts short and deterministic.
  • Treat metadata as identifiers, not document storage.
  • Add a regression test that runs one representative long query against real production-sized docs.

A good rule: if your retrieval pipeline depends on “just let it grab more text,” it will fail later under real data volume. Control chunking first, then control top-k, then control synthesis length.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides