How to Fix 'context length exceeded during development' in LlamaIndex (Python)
What the error means
ValueError: context length exceeded in LlamaIndex means you tried to send more tokens to the model than the model’s context window allows. It usually shows up during development when you stuff too much text into a prompt, pass an oversized Document, or build a query engine that retrieves too many chunks.
In practice, this is almost always a token budgeting problem, not a “LlamaIndex bug”.
The Most Common Cause
The #1 cause is building prompts or retrieval chains that inject too much raw text into the LLM call.
A common pattern is loading a large document and asking LlamaIndex to summarize or answer with default settings, then hitting an error like:
- •
ValueError: Requested 16384 tokens, but the model only supports 8192 tokens - •
BadRequestError: context length exceeded - •
RuntimeError: prompt too long
Wrong pattern vs right pattern
| Broken code | Fixed code |
|---|---|
| Sends too much text in one shot | Splits text into chunks |
Uses large chunk_size without thinking about output tokens | Uses smaller chunks and controlled retrieval |
| Lets the retriever return too many nodes | Limits retrieved context |
# BROKEN
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
docs = SimpleDirectoryReader("data").load_data()
# This can easily create oversized prompts if docs are large
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(
llm=OpenAI(model="gpt-4o-mini"),
similarity_top_k=10, # too many chunks for some prompts/models
)
response = query_engine.query("Summarize the policy in detail.")
print(response)
# FIXED
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
docs = SimpleDirectoryReader("data").load_data()
# Reduce chunk size so each node fits comfortably in context
Settings.chunk_size = 512
Settings.chunk_overlap = 50
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(
similarity_top_k=3, # fewer chunks = less prompt bloat
)
response = query_engine.query("Summarize the policy in detail.")
print(response)
If you’re using a chat-style workflow, the same issue happens when you keep appending conversation history forever.
# BROKEN: unbounded chat history
chat_history.append({"role": "user", "content": user_message})
chat_history.append({"role": "assistant", "content": assistant_message})
# eventually this overflows the model context window
Use truncation or summarization before sending history back to the model.
Other Possible Causes
1) Your chunk_size is too large
If your chunk size is huge, each node becomes expensive to include in a prompt.
from llama_index.core import Settings
Settings.chunk_size = 4096 # risky for smaller models
Settings.chunk_overlap = 200
For most RAG setups, start smaller and measure. Bigger chunks are not automatically better.
2) You’re retrieving too many nodes
Even with sane chunking, similarity_top_k=10 or 20 can blow up your prompt fast.
query_engine = index.as_query_engine(similarity_top_k=12)
Try:
query_engine = index.as_query_engine(similarity_top_k=3)
If you need broad coverage, use reranking or multi-step retrieval instead of dumping everything into one prompt.
3) Your prompt template is doing extra damage
A long system prompt plus long instructions plus retrieved context is enough to exceed limits.
from llama_index.core.prompts import PromptTemplate
qa_template = PromptTemplate("""
You are a compliance assistant.
Follow these rules:
1. ...
2. ...
3. ...
[very long policy text here]
Question: {query_str}
Context: {context_str}
Answer:
""")
Shorten instructions and keep policy text out of static prompts if it can be retrieved dynamically.
4) You picked a smaller-context model than you think
This happens when local models or cheaper hosted models have tighter windows than GPT-4 class models.
llm = OpenAI(model="gpt-4o-mini") # not the same as using a larger-context model
Check the actual context window for the exact model name you configured. Don’t assume all “GPT” models behave the same.
How to Debug It
- •
Print token-heavy inputs before calling the engine
Log your prompt, retrieved nodes, and chat history size. If you’re passing megabytes of text around, that’s your problem. - •
Reduce
similarity_top_kto 1 or 2
If the error disappears, retrieval volume was causing it. Increase slowly until it breaks again. - •
Shrink
chunk_sizeand test again
Drop from something like2048to512. Oversized chunks are one of the fastest ways to trigger this error. - •
Swap in a larger-context model temporarily
If it works on a larger window but fails on your target model, you’ve confirmed it’s token budget related.
A useful debugging trick is to inspect how much text LlamaIndex is actually sending:
retriever = index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve("What does the policy say about claims?")
for node in nodes:
print(len(node.node.get_content()), node.node.node_id)
That won’t give exact token counts, but it will quickly show whether one node is absurdly large.
Prevention
- •Keep chunk sizes conservative unless you have measured evidence that larger chunks help.
- •Cap retrieval with
similarity_top_k, then add reranking if you need better precision. - •Budget for prompt growth early: system prompt + user question + retrieved context + output tokens all count toward the same limit.
- •Add tests that run representative queries against your real documents before shipping changes to chunking or retrieval settings.
If you build RAG systems in LlamaIndex like this from day one, “context length exceeded” becomes a configuration issue you catch early instead of a production surprise.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit