How to Fix 'rate limit exceeded when scaling' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-21

rate-limit-exceeded-when-scalingllamaindextypescript

When you see rate limit exceeded when scaling in a LlamaIndex TypeScript app, it usually means your code is creating too many concurrent LLM or embedding requests. In practice, this shows up during ingestion, query fan-out, or when you scale from one request to many without adding backpressure.

The key thing: this is rarely a “LlamaIndex bug.” It’s usually a concurrency problem, a provider quota problem, or both.

The Most Common Cause

The #1 cause is unbounded parallelism. In TypeScript, the usual pattern is Promise.all(...) over a large set of chunks, documents, or queries. That looks fine in local testing, then blows up when you scale because every task hits the model at once.

Here’s the broken pattern and the fixed pattern side by side:

Broken	Fixed
```ts
import { OpenAIEmbedding } from "@llamaindex/openai";
import { Document, Settings } from "llamaindex";

Settings.embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small", });

const docs = await loadDocuments();

Settings.embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small", });

const limit = pLimit(3); // keep concurrency bounded const docs = await loadDocuments();

const nodes = await Promise.all( docs.map((doc) => limit(async () => { return await splitAndEmbed(doc); }) ) );


If you’re using `VectorStoreIndex.fromDocuments(...)`, the same issue can happen under the hood when document ingestion fans out into many embedding calls. The fix is to reduce concurrency at the application layer before you call into LlamaIndex.

A real failure often looks like this:

```txt
Error: Rate limit exceeded when scaling
    at OpenAIEmbedding.getTextEmbedding ...
    at VectorStoreIndex.fromDocuments ...

Or:

Error: 429 Too Many Requests
    at OpenAI.chat.completions.create ...

Same root cause, different surface area.

Other Possible Causes

1) Your chunking strategy creates too many requests

If your splitter produces hundreds or thousands of tiny chunks, you multiply embedding calls fast.

import { SentenceSplitter } from "llamaindex";

const splitter = new SentenceSplitter({
  chunkSize: 200,
  chunkOverlap: 0,
});

A more stable setting is usually fewer, larger chunks:

const splitter = new SentenceSplitter({
  chunkSize: 800,
  chunkOverlap: 100,
});

2) You are scaling queries without limiting retrieval fan-out

If one user request triggers multiple sub-queries, tool calls, or rerank steps, you can hit rate limits even if ingestion is fine.

// Bad: firing multiple independent queries at once
await Promise.all([
  queryEngine.query("Question A"),
  queryEngine.query("Question B"),
  queryEngine.query("Question C"),
]);

Throttle it:

for (const q of ["Question A", "Question B", "Question C"]) {
  const result = await queryEngine.query(q);
}

3) Your provider quota is lower than your traffic burst

Sometimes the code is fine and your API plan simply can’t handle the spike.

Check your config and environment:

OPENAI_API_KEY=...
OPENAI_ORG=...

If you’re on Azure OpenAI or another provider through LlamaIndex adapters, inspect deployment-level limits too. The error may appear as 429, rate_limit_exceeded, or provider-specific throttling text.

4) Retries are amplifying traffic instead of smoothing it

A bad retry policy can make rate limiting worse. If every failed call retries immediately across many workers, you create a retry storm.

// Bad: immediate retries with no backoff
async function run() {
  for (let i = 0; i < 5; i++) {
    try {
      return await engine.query("summarize this");
    } catch (e) {
      if (i === 4) throw e;
    }
  }
}

Use exponential backoff with jitter:

async function sleep(ms: number) {
  return new Promise((r) => setTimeout(r, ms));
}

async function retry<T>(fn: () => Promise<T>, attempts = 5) {
  for (let i = 0; i < attempts; i++) {
    try {
      return await fn();
    } catch (e) {
      const delay = Math.min(1000 * 2 ** i, 15000);
      await sleep(delay + Math.floor(Math.random() * 250));
      if (i === attempts - 1) throw e;
    }
  }
}

How to Debug It

•
Find where the burst happens
- •Log around VectorStoreIndex.fromDocuments, embedModel.getTextEmbedding, and queryEngine.query.
- •If the error appears during ingestion, it’s probably embeddings.
- •If it appears during chat/query time, it’s probably completion fan-out.
•
Measure concurrency
- •Count how many async tasks run at once.
- •If you see Promise.all over dozens or hundreds of items, that’s your first fix.
•
Check whether it’s provider-limited
- •Run the same workload with half the input size.
- •If smaller batches succeed and bigger ones fail with 429 Too Many Requests, you’re hitting quota or RPM/TPM limits.
•
Inspect retries and worker count
- •If you have queue workers, serverless instances, or cron jobs all running the same pipeline, they may be multiplying load.
- •
  One worker plus bad retries can be enough to trigger:
```
Error: rate limit exceeded when scaling
```

Prevention

•Cap concurrency everywhere you call LLMs or embeddings.
•Batch ingestion jobs and keep chunk sizes reasonable.
•Add exponential backoff with jitter for all provider calls.
•Treat Promise.all as a red flag when it wraps model calls.
•Load test against real provider limits before shipping to production.

If you want one rule to remember: don’t let application parallelism exceed model throughput. In LlamaIndex TypeScript apps, that gap is where rate limit exceeded when scaling shows up.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit