LlamaIndex Tutorial (TypeScript): rate limiting API calls for beginners

By Cyprian AaronsUpdated 2026-04-21

llamaindexrate-limiting-api-calls-for-beginnerstypescript

This tutorial shows you how to add rate limiting around LlamaIndex API calls in TypeScript so your app stops hammering OpenAI, Anthropic, or any other LLM provider. You need this when you have bursty traffic, background jobs, or multiple users sharing the same API quota.

What You'll Need

•Node.js 18+ installed
•A TypeScript project set up with ts-node or a build step
•@llamaindex/openai
•@llamaindex/core
•p-limit
•An LLM API key, such as OPENAI_API_KEY
•Basic familiarity with creating an index and calling a query engine in LlamaIndex

Step-by-Step

•Install the dependencies and set up your environment. I’m using OpenAI here because the TypeScript packages are straightforward, but the same pattern works for other providers.

npm install @llamaindex/core @llamaindex/openai p-limit
npm install -D typescript ts-node @types/node

•Create a small rate-limited wrapper around your LlamaIndex calls. The important part is that you limit concurrency and add spacing between requests if you want to protect against burst limits.

import pLimit from "p-limit";

const limit = pLimit(2);

export async function rateLimited<T>(fn: () => Promise<T>): Promise<T> {
  return limit(async () => {
    await new Promise((resolve) => setTimeout(resolve, 250));
    return fn();
  });
}

•Build a simple index and wrap the query call with the limiter. This example uses VectorStoreIndex and OpenAI, then sends all queries through rateLimited() so only a small number run at once.

import { Document, VectorStoreIndex } from "@llamaindex/core";
import { OpenAI } from "@llamaindex/openai";
import { rateLimited } from "./rateLimit";

async function main() {
  const llm = new OpenAI({
    model: "gpt-4o-mini",
    apiKey: process.env.OPENAI_API_KEY,
  });

  const docs = [
    new Document({ text: "LlamaIndex helps connect data to LLMs." }),
    new Document({ text: "Rate limiting protects API quotas and reduces throttling." }),
  ];

  const index = await VectorStoreIndex.fromDocuments(docs, { llm });
  const queryEngine = index.asQueryEngine();

  const result = await rateLimited(() => queryEngine.query("Why use rate limiting?"));
  console.log(result.toString());
}

main();

•If you have multiple requests, queue them instead of firing them all at once. This is the part beginners usually miss: even if each request is valid, sending ten in parallel can still trigger provider limits.

import { Document, VectorStoreIndex } from "@llamaindex/core";
import { OpenAI } from "@llamaindex/openai";
import { rateLimited } from "./rateLimit";

async function main() {
  const llm = new OpenAI({ model: "gpt-4o-mini", apiKey: process.env.OPENAI_API_KEY });
  const index = await VectorStoreIndex.fromDocuments(
    [new Document({ text: "A short knowledge base for testing." })],
    { llm }
  );

  const queryEngine = index.asQueryEngine();
  const questions = ["What is this?", "Why limit calls?", "How does it help?"];

  const answers = await Promise.all(
    questions.map((q) => rateLimited(() => queryEngine.query(q)))
  );

  answers.forEach((answer) => console.log(answer.toString()));
}

main();

•Add basic retry handling for real-world API throttling. Rate limiting reduces pressure, but you still want to catch 429 responses and retry with backoff when the provider rejects a request anyway.

export async function withRetry<T>(
  fn: () => Promise<T>,
  retries = 3
): Promise<T> {
  let lastError: unknown;

  for (let attempt = 0; attempt <= retries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error;
      await new Promise((resolve) => setTimeout(resolve, 500 * (attempt + 1)));
    }
  }

  throw lastError;
}

•Combine both patterns in production code. Use the limiter to shape traffic and retries to recover from transient throttling; that gives you predictable behavior under load without rewriting your LlamaIndex logic.

import { rateLimited } from "./rateLimit";
import { withRetry } from "./retry";

async function guardedQuery<T>(fn: () => Promise<T>): Promise<T> {
  return rateLimited(() => withRetry(fn));
}

Testing It

Run the script with your API key set in the environment:

OPENAI_API_KEY=your_key_here npx ts-node src/index.ts

To verify it works, send several queries in parallel and watch the requests complete without spiking concurrency. If you lower the p-limit value to 1, you should see requests serialize instead of running together. If your provider logs or dashboard show fewer throttled responses, the limiter is doing its job.

Next Steps

•Add per-user or per-tenant quotas using Redis instead of in-memory limits
•Move from fixed delays to token-bucket or leaky-bucket rate limiting
•Wrap tool calls and retrieval calls too, not just final LLM queries

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit