How to Fix 'rate limit exceeded during development' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-during-developmentllamaindextypescript

When you see rate limit exceeded during development in a LlamaIndex TypeScript app, it usually means your code is firing too many OpenAI requests in a short window. In practice, this shows up during local testing, hot reloads, agent loops, or when you accidentally rebuild the index on every request.

The error often surfaces as an upstream API failure from the model provider, not a LlamaIndex-specific bug. In logs, you’ll usually see something like 429 Too Many Requests, RateLimitError, or a provider message such as You exceeded your current quota or Rate limit reached for gpt-4o-mini.

The Most Common Cause

The #1 cause is rebuilding embeddings or querying the model inside a request handler, render cycle, or loop that runs more often than you think.

With LlamaIndex TS, this pattern is expensive:

  • creating a new OpenAI client repeatedly
  • calling VectorStoreIndex.fromDocuments(...) on every request
  • re-embedding the same documents during development
  • triggering multiple calls because of hot reload

Wrong pattern vs right pattern

Broken codeFixed code
Rebuilds the index on every requestBuilds once and reuses the index
Re-embeds documents repeatedlyPersists or memoizes the index
Creates fresh LLM instances each timeReuses one configured client
// WRONG: expensive work happens on every request
import { OpenAI } from "llamaindex";
import { VectorStoreIndex } from "llamaindex";

export async function POST(req: Request) {
  const body = await req.json();

  const llm = new OpenAI({
    model: "gpt-4o-mini",
    apiKey: process.env.OPENAI_API_KEY,
  });

  const docs = [
    { text: "Policy A..." },
    { text: "Policy B..." },
  ];

  const index = await VectorStoreIndex.fromDocuments(docs, {
    llm,
  });

  const queryEngine = index.asQueryEngine();
  const response = await queryEngine.query({
    query: body.question,
  });

  return Response.json({ answer: response.toString() });
}
// RIGHT: initialize once and reuse
import { OpenAI } from "llamaindex";
import { VectorStoreIndex } from "llamaindex";

const llm = new OpenAI({
  model: "gpt-4o-mini",
  apiKey: process.env.OPENAI_API_KEY,
});

let indexPromise: Promise<VectorStoreIndex> | null = null;

async function getIndex() {
  if (!indexPromise) {
    const docs = [
      { text: "Policy A..." },
      { text: "Policy B..." },
    ];

    indexPromise = VectorStoreIndex.fromDocuments(docs, { llm });
  }
  return indexPromise;
}

export async function POST(req: Request) {
  const body = await req.json();

  const index = await getIndex();
  const queryEngine = index.asQueryEngine();
  const response = await queryEngine.query({
    query: body.question,
  });

  return Response.json({ answer: response.toString() });
}

If you are using VectorStoreIndex.fromDocuments() inside an API route, server action, or React component effect, fix that first. That pattern is responsible for most “rate limit exceeded” reports during development.

Other Possible Causes

1. Your retry logic is multiplying requests

If you wrapped calls in retries without backoff, one failure becomes five more requests immediately.

// Bad: aggressive retry loop
for (let i = 0; i < 5; i++) {
  try {
    return await queryEngine.query({ query });
  } catch (e) {}
}

Use exponential backoff and cap retries.

const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));

for (let i = 0; i < 3; i++) {
  try {
    return await queryEngine.query({ query });
  } catch (e) {
    if (i === 2) throw e;
    await sleep(500 * Math.pow(2, i));
  }
}

2. Hot reload is re-running initialization code

In Next.js or Vite dev mode, module reloads can recreate indexes and clients.

// Better than putting this inside a component or handler
const globalForLlamaIndex = globalThis as typeof globalThis & {
  llamaIndexClient?: OpenAI;
};

export const llm =
  globalForLlamaIndex.llamaIndexClient ??
  new OpenAI({ model: "gpt-4o-mini", apiKey: process.env.OPENAI_API_KEY });

if (!globalForLlamaIndex.llamaIndexClient) {
  globalForLlamaIndex.llamaIndexClient = llm;
}

3. You are chunking too aggressively

Too many small chunks means more embedding calls and more tokens processed.

const splitterConfig = {
  chunkSize: 200,
  chunkOverlap: 20,
};

If your docs are internal policy PDFs or long claims notes, start with larger chunks like 800-1200 and tune from there.

4. You are running parallel queries without throttling

A simple .map(async ...) over many documents can spike requests fast.

// Bad: launches everything at once
await Promise.all(queries.map((q) => queryEngine.query({ query: q })));

Throttle concurrency.

for (const q of queries) {
  await queryEngine.query({ query: q });
}

If you need parallelism, use a concurrency limiter instead of raw Promise.all.

How to Debug It

  1. Check whether the error happens on indexing or querying.

    • If it happens during startup, look at VectorStoreIndex.fromDocuments().
    • If it happens during user input, inspect your queryEngine.query() path.
  2. Log how many times initialization runs.

    • Add a counter around your LlamaIndex setup.
    • If it increments on every request or refresh, you found the issue.
  3. Inspect the exact upstream error.

    • Look for 429 Too Many Requests
    • Look for RateLimitError
    • Look for provider text like You exceeded your current quota
    • Look for repeated calls in network logs
  4. Temporarily disable retries and parallelism.

    • Remove any retry wrappers.
    • Replace Promise.all(...) with sequential execution.
    • If the error disappears, your code was amplifying traffic.

Prevention

  • Initialize LLM clients and indexes once per process, not per request.
  • Persist indexes to disk or a vector store instead of rebuilding them in dev loops.
  • Add throttling and bounded retries with exponential backoff for all model calls.

If you want one rule to remember: don’t treat embeddings and LLM calls like cheap local function calls. In LlamaIndex TypeScript, they are networked operations with real rate limits behind them.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides