How to Fix 'context length exceeded when scaling' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
context-length-exceeded-when-scalingllamaindexpython

When you see ValueError: context length exceeded or a variant like Token limit exceeded while scaling a LlamaIndex pipeline, it means you’re sending more text into the model than the model’s context window can hold. This usually shows up when your index grows, your chunking is too aggressive, or you start stuffing too many retrieved nodes into a single prompt.

In practice, this happens during query time more often than ingestion. The usual pattern is: it works on a small dataset, then breaks once retrieval returns more nodes or your prompt template gets longer.

The Most Common Cause

The #1 cause is building a prompt that includes too much retrieved content. In LlamaIndex, this usually happens with ResponseMode.COMPACT, ResponseMode.TREE_SUMMARIZE, or custom prompts that concatenate large chunks without controlling token budget.

Here’s the broken pattern:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.response_synthesizers import ResponseMode

docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine(
    response_mode=ResponseMode.COMPACT,
    similarity_top_k=20,
)

response = query_engine.query("Summarize the policy changes")
print(response)

And here’s the fixed version:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.response_synthesizers import ResponseMode

docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine(
    response_mode=ResponseMode.SIMPLE_SUMMARIZE,
    similarity_top_k=5,
)

response = query_engine.query("Summarize the policy changes")
print(response)

What changed:

  • similarity_top_k went from 20 to 5
  • COMPACT was replaced with SIMPLE_SUMMARIZE
  • fewer nodes are stuffed into the prompt

If you need higher recall, don’t just increase top_k. Use reranking or a two-step retrieval flow instead of dumping everything into one completion.

Other Possible Causes

1) Chunk size is too large

If your chunks are huge, each retrieved node eats most of the context window.

from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(chunk_size=4096, chunk_overlap=200)

Use something closer to:

from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)

Large chunks are especially bad when combined with long system prompts or multi-turn chat history.

2) You are passing chat history into every query

A common mistake in agent loops is appending full conversation history on every call.

# Broken: keeps growing forever
chat_history.append(user_msg)
chat_history.append(assistant_msg)

response = query_engine.query(
    f"History: {chat_history}\n\nQuestion: {user_question}"
)

Fix it by truncating history or summarizing older turns:

recent_history = chat_history[-6:]

response = query_engine.query(
    f"Recent history: {recent_history}\n\nQuestion: {user_question}"
)

If you’re using ChatEngine, make sure memory has a bounded token limit.

3) Your prompt template is too verbose

Long instructions plus retrieved text can push you over the edge.

from llama_index.core.prompts import PromptTemplate

qa_prompt = PromptTemplate("""
You are an expert compliance assistant.
Follow all bank policy rules.
Do not miss any detail.
Explain everything in full.
Use complete citations.
Here is the context:
{context_str}
Question: {query_str}
""")

Trim it down:

qa_prompt = PromptTemplate("""
Answer using only the provided context.
If the answer is missing, say so.
Context:
{context_str}
Question: {query_str}
""")

Shorter prompts matter more than people expect.

4) You are using a smaller model than your index/query setup assumes

A model with an 8k context window will fail if your retrieval and prompt assembly assume 16k or 32k.

Example mismatch:

Settings.llm = OpenAI(model="gpt-3.5-turbo")  # smaller context window

If your workload needs more room:

Settings.llm = OpenAI(model="gpt-4o-mini")  # larger practical budget depending on provider config

Also check whether your embedding/retrieval flow is optimized for that model’s actual limit.

How to Debug It

  1. Print token sizes before calling the LLM

    • Log how many tokens are in:
      • system prompt
      • user prompt
      • retrieved nodes
      • chat history
  2. Reduce one variable at a time

    • Set similarity_top_k=1
    • Switch to ResponseMode.SIMPLE_SUMMARIZE
    • Remove chat history
    • Shrink chunk size
  3. Inspect retrieved nodes

    • Dump node text lengths and metadata
    • Look for giant PDFs, OCR noise, duplicated sections, or tables exploding token count
  4. Check the exact exception path

    • Common messages include:
      • ValueError: Requested tokens exceed context window
      • context length exceeded
      • provider-specific errors from OpenAI / Anthropic wrappers
    • If it fails inside synthesis, it’s usually retrieval + prompt size.
    • If it fails during ingestion, it’s usually chunking or parsing.

Prevention

  • Keep chunks small enough for retrieval to stay cheap:
    • start with chunk_size=512 and adjust from there
  • Cap retrieval aggressively:
    • use lower similarity_top_k
    • add rerankers instead of brute-force stuffing more nodes
  • Treat prompt size as a budget:
    • reserve space for instructions, chat history, and output tokens before adding context

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides