How to Fix 'token limit exceeded in production' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21

token-limit-exceeded-in-productionllamaindexpython

What the error means

token limit exceeded in production usually means your LlamaIndex pipeline built a prompt or context window that is larger than the model can accept. In practice, this shows up when you stuff too many retrieved chunks into one ResponseSynthesizer, pass oversized chat history into an agent, or use a splitter that creates chunks too large for your downstream model.

The failure often appears as a ValueError from LlamaIndex, or as an upstream OpenAI/Anthropic error after LlamaIndex sends an oversized request. The exact class name depends on where the overflow happens: retrieval, synthesis, or chat memory.

The Most Common Cause

The #1 cause is over-retrieval: you ask LlamaIndex to fetch too many nodes, then concatenate them into a single prompt without enforcing a token budget.

A common broken pattern looks like this:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.response_synthesizers import CompactAndRefine

docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine(
    similarity_top_k=20,
    response_mode="compact",
)

response = query_engine.query("Summarize the policy exceptions.")
print(response)

If those 20 chunks are large, CompactAndRefine will try to pack them into the context window and you’ll hit errors like:

•ValueError: Token limit exceeded
•ValueError: Requested tokens exceed context window
•OpenAIError: This model's maximum context length is ...
•llama_index.core.indices.query.schema.QueryBundle flowing into an oversized synthesis step

The fix is to cap retrieval and chunk size together, not just one side.

Broken	Fixed
`similarity_top_k=20` with default chunking	Lower `top_k` and smaller chunks
No token budgeting before synthesis	Use a token-aware response mode
Large documents split into huge nodes	Tune splitter settings

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter

docs = SimpleDirectoryReader("./docs").load_data()

splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(docs)

index = VectorStoreIndex(nodes)

query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact",
)

response = query_engine.query("Summarize the policy exceptions.")
print(response)

If you need more control, use a retriever + synthesizer flow and explicitly trim what gets passed to the LLM.

Other Possible Causes

1) Your chunk size is too large

If your ingestion pipeline creates giant nodes, every downstream step becomes expensive.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=2048, chunk_overlap=200)  # risky

Use smaller chunks for production RAG:

splitter = SentenceSplitter(chunk_size=384, chunk_overlap=40)

2) Chat memory is growing without bounds

Agents can exceed token limits even when retrieval is fine. This happens when conversation history keeps accumulating in ChatMemoryBuffer or similar memory classes.

from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=4000)

If you omit limits or keep old turns forever, trim aggressively:

memory = ChatMemoryBuffer.from_defaults(token_limit=1500)

For long-running agents, summarize old turns instead of replaying full transcripts.

3) You are stuffing full documents into prompts

This is common in custom workflows where developers bypass retrieval and manually inject raw text into an LLM call.

prompt = f"""
Answer using this document:
{full_contract_text}
"""

That works in dev with one short file. It breaks in production with real contracts, claims notes, or policy packs.

Fix it by retrieving only relevant sections first:

nodes = retriever.retrieve("policy exceptions")
context = "\n\n".join(node.get_content() for node in nodes[:4])

4) Your model context window is smaller than your prompt assumptions

A lot of production bugs come from switching models without adjusting token budgets. A prompt that fits GPT-4-class models may fail on smaller deployments.

llm_kwargs = {
    "model": "gpt-4o-mini",  # smaller budget than you expected
}

Check the real context length of the deployed model and tune:

•similarity_top_k
•chunk size
•memory token limit
•system prompt length

How to Debug It

•
Print the actual prompt size
- •Log retrieved node counts and approximate token totals before synthesis.
- •If you use custom code, inspect what gets concatenated into the final prompt.
•
Reduce retrieval to isolate the problem
- •Set similarity_top_k=1.
- •If the error disappears, your issue is over-retrieval or oversized chunks.
•
Shrink chunking parameters
- •Drop chunk_size to 256–512.
- •Re-index and test again.
- •If that fixes it, your ingestion pipeline was creating nodes that were too large.
•
Check memory growth
- •If this happens only after several turns in an agent session, inspect chat history length.
- •Reset memory between requests or add summarization/truncation.

A practical debugging loop looks like this:

retrieved_nodes = retriever.retrieve(query)
print("nodes:", len(retrieved_nodes))
for i, node in enumerate(retrieved_nodes[:3]):
    print(i, len(node.get_content()))

If those lengths are large enough to blow your context window once combined with system prompts and tool instructions, you found the bug.

Prevention

•
Keep ingestion and retrieval budgets aligned:
- •Smaller chunks at index time
- •Lower similarity_top_k at query time
- •Token-aware response synthesis
•
Put hard caps on memory:
- •Use bounded chat buffers
- •Summarize old conversation turns
- •Never replay entire transcripts by default
•
Add prompt-size logging in production:
- •Log retrieved node count
- •Log estimated tokens before every LLM call
- •Fail fast before sending an oversized request

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit