How to Fix 'token limit exceeded in production' in LlamaIndex (Python)
What the error means
token limit exceeded in production usually means your LlamaIndex pipeline built a prompt or context window that is larger than the model can accept. In practice, this shows up when you stuff too many retrieved chunks into one ResponseSynthesizer, pass oversized chat history into an agent, or use a splitter that creates chunks too large for your downstream model.
The failure often appears as a ValueError from LlamaIndex, or as an upstream OpenAI/Anthropic error after LlamaIndex sends an oversized request. The exact class name depends on where the overflow happens: retrieval, synthesis, or chat memory.
The Most Common Cause
The #1 cause is over-retrieval: you ask LlamaIndex to fetch too many nodes, then concatenate them into a single prompt without enforcing a token budget.
A common broken pattern looks like this:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.response_synthesizers import CompactAndRefine
docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(
similarity_top_k=20,
response_mode="compact",
)
response = query_engine.query("Summarize the policy exceptions.")
print(response)
If those 20 chunks are large, CompactAndRefine will try to pack them into the context window and you’ll hit errors like:
- •
ValueError: Token limit exceeded - •
ValueError: Requested tokens exceed context window - •
OpenAIError: This model's maximum context length is ... - •
llama_index.core.indices.query.schema.QueryBundleflowing into an oversized synthesis step
The fix is to cap retrieval and chunk size together, not just one side.
| Broken | Fixed |
|---|---|
similarity_top_k=20 with default chunking | Lower top_k and smaller chunks |
| No token budgeting before synthesis | Use a token-aware response mode |
| Large documents split into huge nodes | Tune splitter settings |
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
docs = SimpleDirectoryReader("./docs").load_data()
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(docs)
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine(
similarity_top_k=5,
response_mode="compact",
)
response = query_engine.query("Summarize the policy exceptions.")
print(response)
If you need more control, use a retriever + synthesizer flow and explicitly trim what gets passed to the LLM.
Other Possible Causes
1) Your chunk size is too large
If your ingestion pipeline creates giant nodes, every downstream step becomes expensive.
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(chunk_size=2048, chunk_overlap=200) # risky
Use smaller chunks for production RAG:
splitter = SentenceSplitter(chunk_size=384, chunk_overlap=40)
2) Chat memory is growing without bounds
Agents can exceed token limits even when retrieval is fine. This happens when conversation history keeps accumulating in ChatMemoryBuffer or similar memory classes.
from llama_index.core.memory import ChatMemoryBuffer
memory = ChatMemoryBuffer.from_defaults(token_limit=4000)
If you omit limits or keep old turns forever, trim aggressively:
memory = ChatMemoryBuffer.from_defaults(token_limit=1500)
For long-running agents, summarize old turns instead of replaying full transcripts.
3) You are stuffing full documents into prompts
This is common in custom workflows where developers bypass retrieval and manually inject raw text into an LLM call.
prompt = f"""
Answer using this document:
{full_contract_text}
"""
That works in dev with one short file. It breaks in production with real contracts, claims notes, or policy packs.
Fix it by retrieving only relevant sections first:
nodes = retriever.retrieve("policy exceptions")
context = "\n\n".join(node.get_content() for node in nodes[:4])
4) Your model context window is smaller than your prompt assumptions
A lot of production bugs come from switching models without adjusting token budgets. A prompt that fits GPT-4-class models may fail on smaller deployments.
llm_kwargs = {
"model": "gpt-4o-mini", # smaller budget than you expected
}
Check the real context length of the deployed model and tune:
- •
similarity_top_k - •chunk size
- •memory token limit
- •system prompt length
How to Debug It
- •
Print the actual prompt size
- •Log retrieved node counts and approximate token totals before synthesis.
- •If you use custom code, inspect what gets concatenated into the final prompt.
- •
Reduce retrieval to isolate the problem
- •Set
similarity_top_k=1. - •If the error disappears, your issue is over-retrieval or oversized chunks.
- •Set
- •
Shrink chunking parameters
- •Drop
chunk_sizeto 256–512. - •Re-index and test again.
- •If that fixes it, your ingestion pipeline was creating nodes that were too large.
- •Drop
- •
Check memory growth
- •If this happens only after several turns in an agent session, inspect chat history length.
- •Reset memory between requests or add summarization/truncation.
A practical debugging loop looks like this:
retrieved_nodes = retriever.retrieve(query)
print("nodes:", len(retrieved_nodes))
for i, node in enumerate(retrieved_nodes[:3]):
print(i, len(node.get_content()))
If those lengths are large enough to blow your context window once combined with system prompts and tool instructions, you found the bug.
Prevention
- •
Keep ingestion and retrieval budgets aligned:
- •Smaller chunks at index time
- •Lower
similarity_top_kat query time - •Token-aware response synthesis
- •
Put hard caps on memory:
- •Use bounded chat buffers
- •Summarize old conversation turns
- •Never replay entire transcripts by default
- •
Add prompt-size logging in production:
- •Log retrieved node count
- •Log estimated tokens before every LLM call
- •Fail fast before sending an oversized request
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit