How to Fix 'token limit exceeded during development' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceeded-during-developmentllamaindexpython

When you see token limit exceeded during development in LlamaIndex, it usually means your prompt, retrieved context, or chat history is too large for the model window. In practice, this shows up during indexing, retrieval, or agent/tool calls when LlamaIndex tries to pack too much text into a single LLM request.

The fix is usually not “increase the model” first. It’s almost always about reducing what you send to the model, controlling chunk sizes, or changing how LlamaIndex assembles context.

The Most Common Cause

The #1 cause is oversized retrieval context: too many chunks, chunks that are too large, or a query engine that stuffs every relevant node into one prompt.

A common failure looks like this:

# BROKEN
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine(similarity_top_k=10)
response = query_engine.query("Summarize the policy exclusions and claim limits.")
print(response)

This breaks because similarity_top_k=10 can pull in a lot of text, and the default response mode may try to stuff it into one prompt. You’ll often see errors like:

  • ValueError: Token limit exceeded
  • Input too long for model
  • This model's maximum context length is ... tokens

The fix is to reduce retrieval scope and make chunking smaller and more predictable:

# FIXED
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter

docs = SimpleDirectoryReader("data").load_data()

splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(docs)

index = VectorStoreIndex(nodes)

query_engine = index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact",
)

response = query_engine.query("Summarize the policy exclusions and claim limits.")
print(response)

What changed:

  • chunk_size is smaller
  • similarity_top_k is lower
  • response_mode="compact" reduces prompt bloat

If you’re using ResponseSynthesizer, CompactAndRefine, or a chat engine, the same principle applies: stop stuffing everything into one call.

Other Possible Causes

1. Chat memory is growing without bounds

If you use an agent or chat engine and keep appending messages, history can blow past the context window.

# BROKEN
chat_engine = index.as_chat_engine(chat_mode="condense_plus_context")
for msg in user_messages:
    chat_engine.chat(msg)

Fix it by limiting memory or trimming older turns:

# FIXED
from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=2000)
chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    memory=memory,
)

2. Your document chunks are too large

If you ingest long PDFs or legal docs with a huge chunk size, each retrieved node may already be near the model limit.

# BROKEN
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=4096, chunk_overlap=200)

Use smaller chunks for retrieval-heavy apps:

# FIXED
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)

For insurance and banking docs, I usually start around 300–800 tokens per chunk and adjust based on retrieval quality.

3. You are passing raw documents directly into prompts

This happens when developers manually concatenate full documents instead of letting LlamaIndex retrieve selectively.

# BROKEN
full_text = "\n\n".join([doc.text for doc in docs])
prompt = f"Answer using this content:\n{full_text}\n\nQuestion: {question}"

Instead, retrieve only relevant nodes:

# FIXED
query_engine = index.as_query_engine(similarity_top_k=3)
answer = query_engine.query(question)

If you must build prompts manually, truncate aggressively and summarize first.

4. The wrong response mode is stuffing too much context

Some response modes are more token-hungry than others. tree_summarize or refine can still exceed limits if your retrieval set is large.

# CONFIG THAT CAN BLOW UP TOKENS
query_engine = index.as_query_engine(
    similarity_top_k=8,
    response_mode="tree_summarize",
)

Try a tighter mode first:

query_engine = index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact",
)

If you need multi-step synthesis, use smaller retrieval sets per step.

How to Debug It

  1. Check where the exception is thrown

    • If it happens during ingestion, your chunks are probably too large.
    • If it happens during querying, look at retrieval count and response mode.
    • If it happens in chat/agent flows, inspect memory growth.
  2. Print token estimates for retrieved context

    • Log how many nodes are being returned.
    • Inspect node text lengths before they hit the prompt.
    • If one query pulls back thousands of words, that’s your problem.
  3. Reduce variables one at a time

    • Set similarity_top_k=1
    • Switch to response_mode="compact"
    • Lower chunk_size
    • Disable memory temporarily

    If the error disappears after one change, you found the source.

  4. Inspect the actual prompt assembly

    • Use callbacks or logging to see what LlamaIndex sends to the LLM.
    • Look for repeated system prompts, duplicated context, or full document dumps.
    • In agent setups, tool outputs can silently add a lot of tokens.

Prevention

  • Keep ingestion chunks small enough for retrieval tasks:
    • Start with chunk_size=512 and tune from there.
  • Cap retrieval aggressively:
    • Use low similarity_top_k unless you have a strong reason not to.
  • Add token-aware memory:
    • For chat agents, always set a bounded memory buffer.
  • Prefer summarize-then-answer workflows:
    • Don’t pass raw long-form documents straight into prompts.

If you’re building on LlamaIndex in production, treat token budget as an input constraint, not an afterthought. Most “token limit exceeded” issues disappear once you control chunking, retrieval depth, and memory growth.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides