How to Fix 'rate limit exceeded in production' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-in-productionllamaindextypescript

What this error actually means

rate limit exceeded in production usually means your app is sending too many requests to the upstream LLM or embedding provider in a short window. In LlamaIndex TypeScript, this often shows up during indexing, query fan-out, retries, or when multiple user requests hit the same model client at once.

The key point: LlamaIndex is rarely the root cause. It is usually amplifying an application-level concurrency problem, a bad retry loop, or an indexing job that is too aggressive for the provider limits.

The Most Common Cause

The #1 cause is uncontrolled concurrency. Developers often call Promise.all() over a list of documents or chunks, which looks fine locally and then explodes in production when traffic increases.

Here’s the broken pattern:

import { OpenAI } from "@llamaindex/openai";
import { VectorStoreIndex } from "llamaindex";

const llm = new OpenAI({ model: "gpt-4o-mini" });

async function indexDocs(docs: string[]) {
  // BAD: all requests fire at once
  const nodes = await Promise.all(
    docs.map(async (doc) => {
      return llm.complete(`Summarize this document:\n${doc}`);
    })
  );

  return VectorStoreIndex.fromDocuments(nodes as any);
}

And here’s the fixed pattern with bounded concurrency:

import pLimit from "p-limit";
import { OpenAI } from "@llamaindex/openai";
import { VectorStoreIndex } from "llamaindex";

const llm = new OpenAI({ model: "gpt-4o-mini" });
const limit = pLimit(3); // keep this below your provider's burst limit

async function indexDocs(docs: string[]) {
  const nodes = await Promise.all(
    docs.map((doc) =>
      limit(async () => {
        return llm.complete(`Summarize this document:\n${doc}`);
      })
    )
  );

  return VectorStoreIndex.fromDocuments(nodes as any);
}

Why this matters:

  • Promise.all() sends every request immediately
  • LlamaIndex does not automatically throttle your app-level fan-out
  • Provider rate limits are usually enforced per minute and per second, not just per API key

If you see errors like:

  • 429 Too Many Requests
  • RateLimitError: Rate limit exceeded
  • OpenAIError: The server had an error processing your request. Please try again later.

then concurrency is the first thing to inspect.

Other Possible Causes

1) Retry logic that retries too aggressively

A bad retry policy turns one rate limit into ten more requests.

// BAD: immediate retry with no backoff
for (let i = 0; i < 5; i++) {
  try {
    return await queryEngine.query("What is in these docs?");
  } catch (err) {
    if (i === 4) throw err;
  }
}

Use exponential backoff and jitter:

const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));

for (let i = 0; i < 5; i++) {
  try {
    return await queryEngine.query("What is in these docs?");
  } catch (err: any) {
    if (i === 4) throw err;
    await sleep(250 * Math.pow(2, i));
  }
}

2) Multiple workers sharing the same API key

If you run several Node processes, serverless instances, or background jobs at once, each one may be “safe” individually but exceed the account quota together.

// Example: every worker uses the same key and hits the same provider burst window
const llm = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

Fix by coordinating throughput across workers:

// Put a queue in front of LLM calls.
// Example only: use Redis/BullMQ/SQS depending on your stack.

At minimum:

  • reduce parallel workers
  • isolate indexing jobs from online traffic
  • separate keys for batch jobs vs user-facing requests

3) Chunking settings create too many downstream calls

Bad chunking can multiply embeddings and retrieval calls.

import { SentenceSplitter } from "llamaindex";

const splitter = new SentenceSplitter({
  chunkSize: 200,
  chunkOverlap: 100,
});

This creates lots of tiny chunks. That means more embedding requests and more retrieval noise.

A more reasonable setup:

import { SentenceSplitter } from "llamaindex";

const splitter = new SentenceSplitter({
  chunkSize: 1024,
  chunkOverlap: 128,
});

4) Query-time fan-out across multiple tools or retrievers

If you use multiple retrievers, sub-queries, or agent tools, one user request can trigger several model calls.

// BAD: multiple retrievers queried at once without throttling
const results = await Promise.all([
  retrieverA.retrieve(query),
  retrieverB.retrieve(query),
  retrieverC.retrieve(query),
]);

Throttle those calls or sequence them if latency allows:

const resultsA = await retrieverA.retrieve(query);
const resultsB = await retrieverB.retrieve(query);
const resultsC = await retrieverC.retrieve(query);

How to Debug It

  1. Count how many provider calls happen per user request

    • Log every complete(), chat(), embed(), and retrieval call.
    • If one request triggers dozens of calls, you found the multiplier.
  2. Check whether the error happens during indexing or querying

    • Indexing failures usually point to embeddings or batch ingestion.
    • Query failures usually point to chat completions, reranking, or agent tool loops.
  3. Inspect concurrency at runtime

    • Look for Promise.all, parallel job runners, autoscaled workers, or serverless bursts.
    • If traffic spikes correlate with failures, cap concurrency first.
  4. Read the exact upstream error

    • LlamaIndex often wraps provider errors.
    • Search for messages like:
      • RateLimitError
      • 429 Too Many Requests
      • insufficient_quota
      • You exceeded your current quota

Prevention

  • Use bounded concurrency everywhere you call LLMs or embeddings.
  • Add exponential backoff with jitter for transient provider errors.
  • Separate batch indexing workloads from interactive query traffic.
  • Monitor request rate per model, not just total app traffic.
  • Set alerting on 429 spikes before they become production incidents.

If you want one practical rule: never let a single web request fan out into unbounded LLM calls. That pattern is what turns a normal LlamaIndex integration into a rate-limit incident.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides