How to Fix 'token limit exceeded' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceededllamaindextypescript

If you’re seeing Error: token limit exceeded in LlamaIndex TypeScript, it means the text you’re sending into the model is larger than the model’s context window or larger than the token budget LlamaIndex reserved for that step. This usually shows up during indexing, retrieval, or chat when a document chunk, prompt, or accumulated conversation history gets too large.

In practice, this is almost always a chunking or prompt-construction problem, not a model problem. The fix is usually to reduce input size, cap retrieval results, or stop stuffing too much text into one call.

The Most Common Cause

The #1 cause is loading full documents into an index without proper chunking, then asking LlamaIndex to synthesize over too much text at once.

This often happens with VectorStoreIndex.fromDocuments() or Document objects that contain large raw files, while using defaults that don’t match your model’s context window.

Broken vs fixed pattern

BrokenFixed
Load huge docs and query directlySplit into smaller chunks before indexing
Let retrieval return too many nodesCap similarityTopK / retriever count
Use default prompt size assumptionsKeep response synthesis under budget
import { Document, VectorStoreIndex } from "llamaindex";

// ❌ Broken: huge raw document, no chunking strategy
const docs = [
  new Document({
    text: bigPolicyPdfText,
    metadata: { source: "policy.pdf" },
  }),
];

const index = await VectorStoreIndex.fromDocuments(docs);
const queryEngine = index.asQueryEngine();

const result = await queryEngine.query({
  query: "Summarize the claims exclusions.",
});
console.log(result.toString());
import {
  Document,
  VectorStoreIndex,
  SentenceSplitter,
} from "llamaindex";

// ✅ Fixed: chunk first, then index
const docs = [
  new Document({
    text: bigPolicyPdfText,
    metadata: { source: "policy.pdf" },
  }),
];

const splitter = new SentenceSplitter({
  chunkSize: 800,
  chunkOverlap: 100,
});

const nodes = await splitter.getNodesFromDocuments(docs);
const index = await VectorStoreIndex.fromDocuments(nodes);

const queryEngine = index.asQueryEngine({
  similarityTopK: 3,
});

const result = await queryEngine.query({
  query: "Summarize the claims exclusions.",
});
console.log(result.toString());

If you see errors like Token limit exceeded, Context length exceeded, or a failure from a ResponseSynthesizer, this is the first place to look.

Other Possible Causes

1) Retrieval returns too many nodes

If your retriever pulls back 10–20 chunks, LlamaIndex may try to pack them all into one synthesis prompt.

const queryEngine = index.asQueryEngine({
  similarityTopK: 10, // often too high for long chunks
});

Fix it by lowering the retrieval count:

const queryEngine = index.asQueryEngine({
  similarityTopK: 3,
});

2) Chat history is growing without bounds

If you’re using a chat engine and keep appending messages forever, the history becomes part of every request.

// ❌ Broken: unbounded conversation history
await chatEngine.chat({ message: userInput });

Use a memory window or trim old messages before each call:

// ✅ Fixed: bounded memory strategy
const chatEngine = index.asChatEngine({
  chatMode: "context",
});

await chatEngine.chat({ message: userInput });

If your implementation stores messages manually, keep only the last N turns.

3) Prompt templates are too verbose

A long system prompt plus retrieved context can push you over the limit fast.

// ❌ Too much instruction text
const prompt = `
You are an expert insurance assistant.
Follow these rules...
[2000 more characters]
`;

Shorten it and move static instructions out of the main context when possible:

// ✅ Shorter prompt
const prompt = `
Answer using only the provided context.
If missing, say you don't know.
`;

4) Your document parser produces oversized chunks

PDF extraction and OCR often create giant paragraphs with no natural breaks. That means your splitter has nothing useful to work with.

// Problematic if parser returns one massive string
const doc = new Document({ text: extractedPdfText });

Add explicit splitting and inspect output size:

console.log("Chars:", extractedPdfText.length);

If needed, lower chunkSize aggressively:

new SentenceSplitter({
  chunkSize: 500,
  chunkOverlap: 50,
});

How to Debug It

  1. Check where the error happens

    • Indexing?
    • Retrieval?
    • Chat response synthesis?
    • If it fails during VectorStoreIndex.fromDocuments(), your chunks are probably too large.
    • If it fails during .query(), your retrieval set or prompt is too big.
  2. Log chunk sizes before indexing

    • Print character counts for each node.
    • Any chunk over a few thousand characters is worth inspecting.
nodes.forEach((node, i) => {
  console.log(i, node.text.length);
});
  1. Reduce retrieval scope

    • Set similarityTopK to 1 or 3.
    • If the error disappears, your issue is context packing.
  2. Strip down prompts and chat history

    • Remove custom system prompts.
    • Clear conversation state.
    • Retry with a single short question against one small document.

Prevention

  • Use explicit chunking defaults in every project:
    • chunkSize: 500–1000
    • chunkOverlap: 50–150
  • Keep retrieval tight:
    • Start with similarityTopK: 2 or 3
  • Add token budgeting checks in tests:
    • Validate document size before indexing
    • Validate prompt length before calling the model

If you want a stable production setup, treat token budget as an input contract. Don’t let raw PDFs, unlimited chat history, and wide retrieval all hit the model in one shot.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides