How to Fix 'token limit exceeded' in LlamaIndex (Python)
If you’re seeing ValueError: Token limit exceeded in LlamaIndex, it means one of the objects you’re passing into the LLM pipeline is too large for the model’s context window. This usually happens during retrieval, query synthesis, summarization, or when you stuff too many documents into a single prompt.
In practice, this error shows up when you send raw text chunks that are too big, retrieve too many nodes at once, or build an index/query engine without controlling chunk size and context size.
The Most Common Cause
The #1 cause is simple: you’re passing too much text into a single LLM call. In LlamaIndex, this often happens when using ResponseMode.COMPACT, ResponseMode.SIMPLE_SUMMARIZE, or a prompt template that concatenates too many retrieved nodes.
Here’s the broken pattern:
| Broken code | Fixed code |
|---|---|
| ```python | |
| from llama_index.core import VectorStoreIndex, SimpleDirectoryReader |
docs = SimpleDirectoryReader("data").load_data() index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(response_mode="compact")
response = query_engine.query("Summarize all customer complaints")
print(response)
|python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.response_synthesizers import ResponseMode
docs = SimpleDirectoryReader("data").load_data() index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine( response_mode=ResponseMode.REFINE, similarity_top_k=3, ) response = query_engine.query("Summarize the main customer complaints") print(response)
Why this works:
- `compact` tries to fit as much retrieved context as possible into one prompt.
- `refine` processes nodes incrementally, which is safer for larger corpora.
- Lowering `similarity_top_k` reduces how many chunks get stuffed into the prompt.
If you’re using a chat model directly through LlamaIndex, the underlying failure often looks like this:
```text
ValueError: Token limit exceeded: total tokens (input + output) exceed model context window
Other Possible Causes
1. Chunk size is too large during ingestion
If your document chunks are huge, every retrieval returns oversized text blocks.
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter
Settings.node_parser = SentenceSplitter(chunk_size=2048, chunk_overlap=200)
For most production RAG systems, start smaller:
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
2. You’re retrieving too many nodes
Even with reasonable chunk sizes, asking for too many results can overflow context.
query_engine = index.as_query_engine(similarity_top_k=10)
Try reducing it:
query_engine = index.as_query_engine(similarity_top_k=3)
If precision matters more than recall, keep it low and rerank later.
3. Your prompt template is too verbose
Custom prompts can silently eat your token budget before any retrieved context is added.
from llama_index.core.prompts import PromptTemplate
qa_prompt = PromptTemplate("""
You are a highly detailed assistant.
Please analyze every possible implication.
Use the following context carefully and comprehensively:
{context_str}
Question: {query_str}
Answer:
""")
Trim it down:
qa_prompt = PromptTemplate("""
Context:
{context_str}
Question:
{query_str}
Answer concisely:
""")
4. Your output token budget is too high
Some models fail because input tokens plus expected output tokens exceed the limit.
If you’re configuring an OpenAI-compatible LLM through LlamaIndex:
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini", max_tokens=4096)
Lower it if your prompts are large:
llm = OpenAI(model="gpt-4o-mini", max_tokens=512)
Also check whether your model has a smaller context window than you assumed.
How to Debug It
- •
Print token counts before calling the query engine
- •Use a tokenizer or inspect chunk sizes.
- •If one chunk is massive, fix ingestion first.
- •
Reduce retrieval scope
- •Set
similarity_top_k=1. - •If the error disappears, retrieval size was the problem.
- •Set
- •
Switch response mode
- •Try
ResponseMode.REFINEinstead ofCOMPACT. - •If refine works, your synthesis step was overstuffing context.
- •Try
- •
Log the exact prompt being sent
- •Inspect custom templates and system messages.
- •Large instructions often cause the overflow before documents even enter the prompt.
A practical debugging sequence looks like this:
query_engine = index.as_query_engine(
similarity_top_k=1,
response_mode="refine",
)
response = query_engine.query("What are the key issues?")
print(response)
If that works, add complexity back one step at a time until it breaks again.
Prevention
- •Keep ingestion chunks small: start with
chunk_size=512and tune from there. - •Use lower retrieval counts by default:
similarity_top_k=3is usually safer than 10. - •Prefer
REFINEor compact summaries over stuffing everything into one prompt. - •Set explicit token budgets in your LLM config instead of relying on defaults.
The real fix is not “make the model bigger” first. It’s controlling how much text enters each stage of your LlamaIndex pipeline so you never hit the context window in the first place.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit