How to Fix 'rate limit exceeded when scaling' in LlamaIndex (TypeScript)
When you see rate limit exceeded when scaling in a LlamaIndex TypeScript app, it usually means your code is creating too many concurrent LLM or embedding requests. In practice, this shows up during ingestion, query fan-out, or when you scale from one request to many without adding backpressure.
The key thing: this is rarely a “LlamaIndex bug.” It’s usually a concurrency problem, a provider quota problem, or both.
The Most Common Cause
The #1 cause is unbounded parallelism. In TypeScript, the usual pattern is Promise.all(...) over a large set of chunks, documents, or queries. That looks fine in local testing, then blows up when you scale because every task hits the model at once.
Here’s the broken pattern and the fixed pattern side by side:
| Broken | Fixed |
|---|---|
| ```ts | |
| import { OpenAIEmbedding } from "@llamaindex/openai"; | |
| import { Document, Settings } from "llamaindex"; |
Settings.embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small", });
const docs = await loadDocuments();
const nodes = await Promise.all(
docs.map(async (doc) => {
// Each doc triggers an embedding request immediately
return await splitAndEmbed(doc);
})
);
|ts
import pLimit from "p-limit";
import { OpenAIEmbedding } from "@llamaindex/openai";
import { Settings } from "llamaindex";
Settings.embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small", });
const limit = pLimit(3); // keep concurrency bounded const docs = await loadDocuments();
const nodes = await Promise.all( docs.map((doc) => limit(async () => { return await splitAndEmbed(doc); }) ) );
If you’re using `VectorStoreIndex.fromDocuments(...)`, the same issue can happen under the hood when document ingestion fans out into many embedding calls. The fix is to reduce concurrency at the application layer before you call into LlamaIndex.
A real failure often looks like this:
```txt
Error: Rate limit exceeded when scaling
at OpenAIEmbedding.getTextEmbedding ...
at VectorStoreIndex.fromDocuments ...
Or:
Error: 429 Too Many Requests
at OpenAI.chat.completions.create ...
Same root cause, different surface area.
Other Possible Causes
1) Your chunking strategy creates too many requests
If your splitter produces hundreds or thousands of tiny chunks, you multiply embedding calls fast.
import { SentenceSplitter } from "llamaindex";
const splitter = new SentenceSplitter({
chunkSize: 200,
chunkOverlap: 0,
});
A more stable setting is usually fewer, larger chunks:
const splitter = new SentenceSplitter({
chunkSize: 800,
chunkOverlap: 100,
});
2) You are scaling queries without limiting retrieval fan-out
If one user request triggers multiple sub-queries, tool calls, or rerank steps, you can hit rate limits even if ingestion is fine.
// Bad: firing multiple independent queries at once
await Promise.all([
queryEngine.query("Question A"),
queryEngine.query("Question B"),
queryEngine.query("Question C"),
]);
Throttle it:
for (const q of ["Question A", "Question B", "Question C"]) {
const result = await queryEngine.query(q);
}
3) Your provider quota is lower than your traffic burst
Sometimes the code is fine and your API plan simply can’t handle the spike.
Check your config and environment:
OPENAI_API_KEY=...
OPENAI_ORG=...
If you’re on Azure OpenAI or another provider through LlamaIndex adapters, inspect deployment-level limits too. The error may appear as 429, rate_limit_exceeded, or provider-specific throttling text.
4) Retries are amplifying traffic instead of smoothing it
A bad retry policy can make rate limiting worse. If every failed call retries immediately across many workers, you create a retry storm.
// Bad: immediate retries with no backoff
async function run() {
for (let i = 0; i < 5; i++) {
try {
return await engine.query("summarize this");
} catch (e) {
if (i === 4) throw e;
}
}
}
Use exponential backoff with jitter:
async function sleep(ms: number) {
return new Promise((r) => setTimeout(r, ms));
}
async function retry<T>(fn: () => Promise<T>, attempts = 5) {
for (let i = 0; i < attempts; i++) {
try {
return await fn();
} catch (e) {
const delay = Math.min(1000 * 2 ** i, 15000);
await sleep(delay + Math.floor(Math.random() * 250));
if (i === attempts - 1) throw e;
}
}
}
How to Debug It
- •
Find where the burst happens
- •Log around
VectorStoreIndex.fromDocuments,embedModel.getTextEmbedding, andqueryEngine.query. - •If the error appears during ingestion, it’s probably embeddings.
- •If it appears during chat/query time, it’s probably completion fan-out.
- •Log around
- •
Measure concurrency
- •Count how many async tasks run at once.
- •If you see
Promise.allover dozens or hundreds of items, that’s your first fix.
- •
Check whether it’s provider-limited
- •Run the same workload with half the input size.
- •If smaller batches succeed and bigger ones fail with
429 Too Many Requests, you’re hitting quota or RPM/TPM limits.
- •
Inspect retries and worker count
- •If you have queue workers, serverless instances, or cron jobs all running the same pipeline, they may be multiplying load.
- •One worker plus bad retries can be enough to trigger:
Error: rate limit exceeded when scaling
Prevention
- •Cap concurrency everywhere you call LLMs or embeddings.
- •Batch ingestion jobs and keep chunk sizes reasonable.
- •Add exponential backoff with jitter for all provider calls.
- •Treat
Promise.allas a red flag when it wraps model calls. - •Load test against real provider limits before shipping to production.
If you want one rule to remember: don’t let application parallelism exceed model throughput. In LlamaIndex TypeScript apps, that gap is where rate limit exceeded when scaling shows up.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit