How to Fix 'token limit exceeded in production' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceeded-in-productionllamaindextypescript

When you see token limit exceeded in production in a LlamaIndex TypeScript app, it usually means one of your prompts, retrieved chunks, or chat history is too large for the model context window. In practice, this shows up when your index returns too much text, or when you keep appending messages without trimming.

The failure is usually not in the model itself. It’s almost always in how you build the query payload before calling queryEngine.query() or chatEngine.chat().

The Most Common Cause

The #1 cause is sending too much retrieved context into the LLM.

In LlamaIndex TS, this often happens when you use a retriever with large chunks and no post-processing, then pass everything into a synthesizer that tries to stuff all of it into one prompt. The error often looks like this:

  • Error: token limit exceeded
  • context length exceeded
  • This model's maximum context length is ... tokens
  • BadRequestError: Request too large for gpt-4o-mini in tokens

Broken vs fixed pattern

BrokenFixed
Retrieve too many large nodes and stuff them directly into synthesisLimit retrieval depth and trim chunk size
No token-aware filteringUse a postprocessor or smaller chunks
Default prompt assembly with no budget controlEnforce a context budget before query
import { VectorStoreIndex } from "llamaindex";

// Broken: large chunks + broad retrieval + no filtering
const index = await VectorStoreIndex.fromDocuments(docs);

const queryEngine = index.asQueryEngine({
  similarityTopK: 10,
});

const response = await queryEngine.query({
  query: "Summarize all customer policy exceptions from the uploaded documents.",
});

console.log(response.toString());
import {
  VectorStoreIndex,
  SentenceSplitter,
  SimilarityPostprocessor,
} from "llamaindex";

// Fixed: smaller chunks + tighter retrieval + token-aware filtering
const splitter = new SentenceSplitter({
  chunkSize: 512,
  chunkOverlap: 50,
});

const index = await VectorStoreIndex.fromDocuments(docs, {
  transformations: [splitter],
});

const queryEngine = index.asQueryEngine({
  similarityTopK: 3,
  nodePostprocessors: [
    new SimilarityPostprocessor({ similarityCutoff: 0.8 }),
  ],
});

const response = await queryEngine.query({
  query: "Summarize the policy exceptions relevant to claims handling.",
});

console.log(response.toString());

If your application needs broad recall, don’t solve it by increasing topK. That just moves the failure downstream into the prompt builder.

Other Possible Causes

1. Chat history is growing without bounds

This happens in CondenseQuestionChatEngine or any custom chat loop that keeps every message forever.

// Broken
messages.push({ role: "user", content });
messages.push({ role: "assistant", content: answer });

await chatEngine.chat({ message: content, chatHistory: messages });
// Fixed
const trimmedHistory = messages.slice(-6);

await chatEngine.chat({
  message: content,
  chatHistory: trimmedHistory,
});

If you’re building a support bot, keep only the last few turns plus a running summary.

2. Your document chunks are too large

Large source chunks create oversized retrieval nodes. Even with topK=2, two huge nodes can blow past the model limit.

// Broken
new SentenceSplitter({ chunkSize: 2048, chunkOverlap: 200 });
// Fixed
new SentenceSplitter({ chunkSize: 400, chunkOverlap: 40 });

For legal, insurance, and banking docs, smaller chunks usually produce better retrieval anyway.

3. You’re using a model with a smaller context window than your prompt assumes

A prompt that works on GPT-4o may fail on a smaller deployment target.

// Broken assumption: same prompt everywhere
const llm = new OpenAI({ model: "gpt-4o-mini" });
// Fixed: match prompt size to model capacity
const llm = new OpenAI({
  model: "gpt-4o",
});

If you must stay on a smaller model, reduce retrieved context and compress prompts aggressively.

4. Prompt templates are injecting too much static text

This is common when teams paste policy manuals, schema dumps, or long system instructions into every request.

// Broken
const systemPrompt = `
You are an assistant.
Here is the full claims handbook:
${hugeHandbookText}
`;
// Fixed
const systemPrompt = `
You are an assistant.
Use only the retrieved context below.
If the answer is missing, say so.
`;

Keep static prompts short. Put long reference material in retrieval, not in every call.

How to Debug It

  1. Log token-heavy inputs before the query

    • Print the final prompt length, retrieved node count, and chat history size.
    • If you can’t inspect the assembled prompt, add logging around your query engine wrapper.
  2. Reduce one variable at a time

    • Set similarityTopK to 1.
    • Cut chunk size in half.
    • Trim chat history to the last two turns.
    • Retry after each change until the error disappears.
  3. Check which LlamaIndex component is assembling context

    • VectorStoreIndex.asQueryEngine()
    • CondenseQuestionChatEngine
    • custom retriever + synthesizer pipeline
      The failing layer tells you where to enforce limits.
  4. Compare against model limits

    • Confirm your target model’s max input tokens.
    • Estimate total tokens from:
      • system prompt
      • user question
      • retrieved nodes
      • chat history
        If that sum is close to the limit, you’ve found the problem.

Prevention

  • Use smaller chunks by default:

    • start around chunkSize: 300–600
    • increase only if retrieval quality drops
  • Cap retrieval aggressively:

    • keep similarityTopK low unless you have a reranker or summarizer stage
  • Add token budgets at the application layer:

    • trim chat history
    • truncate long documents before indexing
    • reject oversized requests early with a clear error message

If this error appears in production once, assume it will happen again under heavier traffic or longer conversations. Put guardrails around retrieval and prompt assembly now, not after your next incident.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides