How to Fix 'token limit exceeded' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceededllamaindexpython

If you’re seeing ValueError: Token limit exceeded in LlamaIndex, it means one of the objects you’re passing into the LLM pipeline is too large for the model’s context window. This usually happens during retrieval, query synthesis, summarization, or when you stuff too many documents into a single prompt.

In practice, this error shows up when you send raw text chunks that are too big, retrieve too many nodes at once, or build an index/query engine without controlling chunk size and context size.

The Most Common Cause

The #1 cause is simple: you’re passing too much text into a single LLM call. In LlamaIndex, this often happens when using ResponseMode.COMPACT, ResponseMode.SIMPLE_SUMMARIZE, or a prompt template that concatenates too many retrieved nodes.

Here’s the broken pattern:

Broken codeFixed code
```python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("data").load_data() index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine(response_mode="compact") response = query_engine.query("Summarize all customer complaints") print(response) |python from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core.response_synthesizers import ResponseMode

docs = SimpleDirectoryReader("data").load_data() index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine( response_mode=ResponseMode.REFINE, similarity_top_k=3, ) response = query_engine.query("Summarize the main customer complaints") print(response)


Why this works:

- `compact` tries to fit as much retrieved context as possible into one prompt.
- `refine` processes nodes incrementally, which is safer for larger corpora.
- Lowering `similarity_top_k` reduces how many chunks get stuffed into the prompt.

If you’re using a chat model directly through LlamaIndex, the underlying failure often looks like this:

```text
ValueError: Token limit exceeded: total tokens (input + output) exceed model context window

Other Possible Causes

1. Chunk size is too large during ingestion

If your document chunks are huge, every retrieval returns oversized text blocks.

from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter

Settings.node_parser = SentenceSplitter(chunk_size=2048, chunk_overlap=200)

For most production RAG systems, start smaller:

Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)

2. You’re retrieving too many nodes

Even with reasonable chunk sizes, asking for too many results can overflow context.

query_engine = index.as_query_engine(similarity_top_k=10)

Try reducing it:

query_engine = index.as_query_engine(similarity_top_k=3)

If precision matters more than recall, keep it low and rerank later.

3. Your prompt template is too verbose

Custom prompts can silently eat your token budget before any retrieved context is added.

from llama_index.core.prompts import PromptTemplate

qa_prompt = PromptTemplate("""
You are a highly detailed assistant.
Please analyze every possible implication.
Use the following context carefully and comprehensively:
{context_str}
Question: {query_str}
Answer:
""")

Trim it down:

qa_prompt = PromptTemplate("""
Context:
{context_str}

Question:
{query_str}

Answer concisely:
""")

4. Your output token budget is too high

Some models fail because input tokens plus expected output tokens exceed the limit.

If you’re configuring an OpenAI-compatible LLM through LlamaIndex:

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini", max_tokens=4096)

Lower it if your prompts are large:

llm = OpenAI(model="gpt-4o-mini", max_tokens=512)

Also check whether your model has a smaller context window than you assumed.

How to Debug It

  1. Print token counts before calling the query engine

    • Use a tokenizer or inspect chunk sizes.
    • If one chunk is massive, fix ingestion first.
  2. Reduce retrieval scope

    • Set similarity_top_k=1.
    • If the error disappears, retrieval size was the problem.
  3. Switch response mode

    • Try ResponseMode.REFINE instead of COMPACT.
    • If refine works, your synthesis step was overstuffing context.
  4. Log the exact prompt being sent

    • Inspect custom templates and system messages.
    • Large instructions often cause the overflow before documents even enter the prompt.

A practical debugging sequence looks like this:

query_engine = index.as_query_engine(
    similarity_top_k=1,
    response_mode="refine",
)

response = query_engine.query("What are the key issues?")
print(response)

If that works, add complexity back one step at a time until it breaks again.

Prevention

  • Keep ingestion chunks small: start with chunk_size=512 and tune from there.
  • Use lower retrieval counts by default: similarity_top_k=3 is usually safer than 10.
  • Prefer REFINE or compact summaries over stuffing everything into one prompt.
  • Set explicit token budgets in your LLM config instead of relying on defaults.

The real fix is not “make the model bigger” first. It’s controlling how much text enters each stage of your LlamaIndex pipeline so you never hit the context window in the first place.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides