LlamaIndex Tutorial (TypeScript): rate limiting API calls for beginners
This tutorial shows you how to add rate limiting around LlamaIndex API calls in TypeScript so your app stops hammering OpenAI, Anthropic, or any other LLM provider. You need this when you have bursty traffic, background jobs, or multiple users sharing the same API quota.
What You'll Need
- •Node.js 18+ installed
- •A TypeScript project set up with
ts-nodeor a build step - •
@llamaindex/openai - •
@llamaindex/core - •
p-limit - •An LLM API key, such as
OPENAI_API_KEY - •Basic familiarity with creating an index and calling a query engine in LlamaIndex
Step-by-Step
- •Install the dependencies and set up your environment. I’m using OpenAI here because the TypeScript packages are straightforward, but the same pattern works for other providers.
npm install @llamaindex/core @llamaindex/openai p-limit
npm install -D typescript ts-node @types/node
- •Create a small rate-limited wrapper around your LlamaIndex calls. The important part is that you limit concurrency and add spacing between requests if you want to protect against burst limits.
import pLimit from "p-limit";
const limit = pLimit(2);
export async function rateLimited<T>(fn: () => Promise<T>): Promise<T> {
return limit(async () => {
await new Promise((resolve) => setTimeout(resolve, 250));
return fn();
});
}
- •Build a simple index and wrap the query call with the limiter. This example uses
VectorStoreIndexandOpenAI, then sends all queries throughrateLimited()so only a small number run at once.
import { Document, VectorStoreIndex } from "@llamaindex/core";
import { OpenAI } from "@llamaindex/openai";
import { rateLimited } from "./rateLimit";
async function main() {
const llm = new OpenAI({
model: "gpt-4o-mini",
apiKey: process.env.OPENAI_API_KEY,
});
const docs = [
new Document({ text: "LlamaIndex helps connect data to LLMs." }),
new Document({ text: "Rate limiting protects API quotas and reduces throttling." }),
];
const index = await VectorStoreIndex.fromDocuments(docs, { llm });
const queryEngine = index.asQueryEngine();
const result = await rateLimited(() => queryEngine.query("Why use rate limiting?"));
console.log(result.toString());
}
main();
- •If you have multiple requests, queue them instead of firing them all at once. This is the part beginners usually miss: even if each request is valid, sending ten in parallel can still trigger provider limits.
import { Document, VectorStoreIndex } from "@llamaindex/core";
import { OpenAI } from "@llamaindex/openai";
import { rateLimited } from "./rateLimit";
async function main() {
const llm = new OpenAI({ model: "gpt-4o-mini", apiKey: process.env.OPENAI_API_KEY });
const index = await VectorStoreIndex.fromDocuments(
[new Document({ text: "A short knowledge base for testing." })],
{ llm }
);
const queryEngine = index.asQueryEngine();
const questions = ["What is this?", "Why limit calls?", "How does it help?"];
const answers = await Promise.all(
questions.map((q) => rateLimited(() => queryEngine.query(q)))
);
answers.forEach((answer) => console.log(answer.toString()));
}
main();
- •Add basic retry handling for real-world API throttling. Rate limiting reduces pressure, but you still want to catch
429responses and retry with backoff when the provider rejects a request anyway.
export async function withRetry<T>(
fn: () => Promise<T>,
retries = 3
): Promise<T> {
let lastError: unknown;
for (let attempt = 0; attempt <= retries; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error;
await new Promise((resolve) => setTimeout(resolve, 500 * (attempt + 1)));
}
}
throw lastError;
}
- •Combine both patterns in production code. Use the limiter to shape traffic and retries to recover from transient throttling; that gives you predictable behavior under load without rewriting your LlamaIndex logic.
import { rateLimited } from "./rateLimit";
import { withRetry } from "./retry";
async function guardedQuery<T>(fn: () => Promise<T>): Promise<T> {
return rateLimited(() => withRetry(fn));
}
Testing It
Run the script with your API key set in the environment:
OPENAI_API_KEY=your_key_here npx ts-node src/index.ts
To verify it works, send several queries in parallel and watch the requests complete without spiking concurrency. If you lower the p-limit value to 1, you should see requests serialize instead of running together. If your provider logs or dashboard show fewer throttled responses, the limiter is doing its job.
Next Steps
- •Add per-user or per-tenant quotas using Redis instead of in-memory limits
- •Move from fixed delays to token-bucket or leaky-bucket rate limiting
- •Wrap tool calls and retrieval calls too, not just final LLM queries
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit