How to Fix 'context length exceeded when scaling' in LlamaIndex (TypeScript)
When you see context length exceeded when scaling in LlamaIndex TypeScript, it usually means your retrieval pipeline is stuffing too much text into the prompt for the selected model. In practice, this shows up when you query a large index, use a high similarityTopK, or combine long documents with a chat model that has a smaller context window.
The key point: this is not a LlamaIndex bug most of the time. It’s usually a prompt assembly problem somewhere between your retriever, node parser, and LLM context limits.
The Most Common Cause
The #1 cause is over-retrieval: you are asking LlamaIndex to pass too many chunks into the final synthesis prompt.
This happens a lot with VectorStoreIndex + asQueryEngine() + a large similarityTopK. The retriever returns too many nodes, and the response synthesizer tries to pack them all into one completion request. Once the combined prompt exceeds the model limit, you get errors like:
- •
Error: context length exceeded - •
BadRequestError: This model's maximum context length is ... - •
context length exceeded when scaling
Wrong pattern vs right pattern
| Broken code | Fixed code |
|---|---|
| ```ts | |
| import { VectorStoreIndex } from "llamaindex"; |
const index = await VectorStoreIndex.fromDocuments(docs);
const queryEngine = index.asQueryEngine({ similarityTopK: 20, });
const response = await queryEngine.query({
query: "Summarize the policy exclusions",
});
console.log(response.toString());
|ts
import { VectorStoreIndex } from "llamaindex";
const index = await VectorStoreIndex.fromDocuments(docs);
const queryEngine = index.asQueryEngine({ similarityTopK: 4, });
const response = await queryEngine.query({ query: "Summarize the policy exclusions", }); console.log(response.toString());
If you still need broader recall, don’t just keep increasing `similarityTopK`. Use a two-step approach:
- retrieve more nodes
- rerank or compact them before synthesis
For example, prefer smaller chunks and tighter retrieval:
```ts
import { VectorStoreIndex, SentenceSplitter } from "llamaindex";
const splitter = new SentenceSplitter({
chunkSize: 512,
chunkOverlap: 50,
});
const index = await VectorStoreIndex.fromDocuments(docs, {
transformations: [splitter],
});
const queryEngine = index.asQueryEngine({
similarityTopK: 4,
});
Other Possible Causes
1) Your chunks are too large
If each node is huge, even topK=3 can blow up the prompt.
// Too large
new SentenceSplitter({
chunkSize: 2048,
chunkOverlap: 200,
});
Use smaller chunks for retrieval-heavy workloads:
// Better for QA and extraction
new SentenceSplitter({
chunkSize: 400,
chunkOverlap: 40,
});
2) You are using a small-context model
Some models have tight limits. If your app was working on one model and failing on another, check the actual context window.
import { OpenAI } from "llamaindex";
const llm = new OpenAI({
model: "gpt-4o-mini",
});
If your prompts are large, switch to a larger-context model or reduce retrieved text. The same query engine can fail simply because the target model changed.
3) Your chat history is being included in every request
If you wrap retrieval inside an agent or chat loop, the accumulated conversation can consume most of the context before retrieval even starts.
// Risky if chat history grows without trimming
const response = await chatEngine.chat({
message: userMessage,
});
Use memory trimming or summary memory patterns instead of letting raw history grow forever.
4) You are using verbose prompts or metadata-heavy nodes
Long metadata fields like full JSON payloads, source documents, or audit trails get injected into prompts fast.
// Bad idea for retrieval nodes
node.metadata = {
rawPayload: JSON.stringify(hugeRecord),
};
Keep metadata minimal:
node.metadata = {
sourceId: record.id,
pageNumber: record.page,
};
How to Debug It
- •
Check your retriever settings first
- •Inspect
similarityTopK, reranker settings, and any recursive retrieval behavior. - •If you’re above
5, lower it immediately and retest.
- •Inspect
- •
Log chunk sizes before indexing
- •Print average token/character length per document chunk.
- •If chunks are massive, fix splitting before touching anything else.
- •
Inspect the exact prompt size
- •Dump retrieved node text lengths and metadata.
- •You want to know whether the overflow comes from content, history, or system instructions.
- •
Swap in a larger-context model temporarily
- •If the error disappears, your pipeline is probably valid but too large for the current model.
- •If it still fails, your prompt assembly is likely pathological.
A practical test looks like this:
const nodes = await retriever.retrieve("Summarize the policy exclusions");
console.log(
nodes.map((n) => ({
score: n.score,
chars: n.node.getContent().length,
metadataKeys: Object.keys(n.node.metadata ?? {}),
}))
);
If those numbers are large, you found your bottleneck.
Prevention
- •Keep retrieval chunks small and consistent.
- •Start with
similarityTopKbetween3and5, then increase only if you’ve measured headroom. - •Treat chat history as bounded state; trim or summarize it before it reaches the LLM.
If you’re building production workflows in TypeScript with LlamaIndex, assume every extra token has a cost. The fix is usually not “use a bigger model forever”; it’s to control what gets packed into the final prompt.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit