How to Fix 'rate limit exceeded' in LlamaIndex (TypeScript)
When you see rate limit exceeded in LlamaIndex TypeScript, it usually means your app is sending too many requests to the underlying model provider in a short window. In practice, this shows up during document ingestion, query loops, chat history replay, or any code path that accidentally triggers repeated LLM calls.
The important bit: LlamaIndex is usually not the thing enforcing the limit. OpenAI, Anthropic, Azure OpenAI, or another provider is returning the error, and LlamaIndex is just surfacing it through classes like OpenAI, Anthropic, OpenAIEmbedding, or query engines built on top of them.
The Most Common Cause
The #1 cause is calling the LLM inside a loop without batching, caching, or throttling.
This happens a lot when developers iterate over documents one by one and call index.insert(), queryEngine.query(), or an embedding function for every item. Each call looks harmless until the provider starts returning something like:
- •
Error: 429 Rate limit exceeded - •
openai.RateLimitError: 429 You exceeded your current quota - •
Anthropic error: rate_limit_exceeded
Wrong pattern vs right pattern
| Broken code | Fixed code |
|---|---|
| ```ts | |
| import { Document, VectorStoreIndex } from "llamaindex"; |
const docs = rawDocs.map((text) => new Document({ text }));
for (const doc of docs) {
const index = await VectorStoreIndex.fromDocuments([doc]);
const engine = index.asQueryEngine();
const response = await engine.query("Summarize this");
console.log(response.toString());
}
|ts
import { Document, VectorStoreIndex } from "llamaindex";
const docs = rawDocs.map((text) => new Document({ text }));
const index = await VectorStoreIndex.fromDocuments(docs); const engine = index.asQueryEngine();
const response = await engine.query("Summarize these documents"); console.log(response.toString());
The broken version creates a fresh index per document and forces repeated embedding + retrieval + generation calls. The fixed version batches documents into one index build and issues one query.
If you need per-document processing, keep the index stable and throttle the work:
```ts
import pLimit from "p-limit";
import { VectorStoreIndex } from "llamaindex";
const limit = pLimit(2);
const index = await VectorStoreIndex.fromDocuments(docs);
const engine = index.asQueryEngine();
const results = await Promise.all(
docs.map((doc) =>
limit(() => engine.query(`Summarize: ${doc.text}`))
)
);
Other Possible Causes
1. Embedding too many chunks too quickly
If you ingest large files with aggressive chunking, you can hit embedding limits before you even query anything.
import { SentenceSplitter } from "llamaindex";
const splitter = new SentenceSplitter({
chunkSize: 200,
chunkOverlap: 50,
});
Too-small chunks create more embedding requests. Increase chunk size if your use case allows it.
2. Multiple concurrent requests from Promise.all
Promise.all() is a common way to trigger burst traffic.
// Bad: fires all queries at once
await Promise.all(
questions.map((q) => engine.query(q))
);
// Better: limit concurrency
import pLimit from "p-limit";
const limit = pLimit(3);
await Promise.all(
questions.map((q) => limit(() => engine.query(q)))
);
3. Rebuilding the index on every request
This is expensive and easy to miss in API handlers.
// Bad: rebuilds embeddings every request
export async function handler(req: Request) {
const index = await VectorStoreIndex.fromDocuments(docs);
return index.asQueryEngine().query(req.body.question);
}
Cache the index outside the handler if your data is stable.
let cachedIndex: VectorStoreIndex | null = null;
export async function getIndex() {
if (!cachedIndex) {
cachedIndex = await VectorStoreIndex.fromDocuments(docs);
}
return cachedIndex;
}
4. Provider-side quotas are actually exhausted
Sometimes this has nothing to do with burst traffic. Your org may be out of quota or on a low RPM/TPM tier.
Check your provider config carefully:
import { OpenAI } from "llamaindex";
const llm = new OpenAI({
model: "gpt-4o-mini",
apiKey: process.env.OPENAI_API_KEY,
});
If the key belongs to a shared org, another service may be consuming the budget.
How to Debug It
- •
Find the exact class that fails
- •Look at the stack trace.
- •If it points to
OpenAI.complete,OpenAI.chat,OpenAIEmbedding.getTextEmbedding, or a retriever/query engine wrapper, you know which layer is generating traffic.
- •
Count calls per request
- •Add logging around every query and embedding path.
- •If one API request triggers dozens of LLM calls, you’ve found your issue.
console.log("query start", { question });
const response = await engine.query(question);
console.log("query end");
- •
Check concurrency
- •Search for
Promise.all, parallel ingestion jobs, cron workers, or queue consumers. - •Burst concurrency is the fastest way to hit provider limits.
- •Search for
- •
Inspect provider headers and dashboard
- •OpenAI and Anthropic dashboards will show usage spikes.
- •If requests are low but failures still happen, you’re likely hitting org quota or model-specific RPM/TPM caps.
Prevention
- •Batch document ingestion instead of building indexes inside loops.
- •Put concurrency limits on all LLM and embedding calls using
p-limitor a queue worker. - •Cache indexes and embeddings when source data does not change frequently.
- •Set retries with backoff for transient
429responses, but don’t use retries to mask bad call patterns.
A good rule: if one user action can trigger more than a handful of model calls, instrument it first. In LlamaIndex TypeScript, most rate limit exceeded errors are self-inflicted by call patterns, not by the framework itself.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit