LlamaIndex Tutorial (TypeScript): rate limiting API calls for advanced developers
This tutorial shows you how to wrap LlamaIndex TypeScript API calls with a real rate limiter so your agent stops hammering OpenAI, Anthropic, or any other upstream service. You need this when you’re building multi-turn agents, batch processors, or retrieval pipelines that can easily trip provider limits and start returning 429s.
What You'll Need
- •Node.js 18+ and npm
- •A TypeScript project with
ts-nodeor a build step - •
llamaindexinstalled - •
bottleneckinstalled for rate limiting - •An API key for your model provider, for example:
- •
OPENAI_API_KEY
- •
- •Basic familiarity with:
- •LlamaIndex
Settings - •
OpenAIEmbedding - •
OpenAILLMs - •async/await in TypeScript
- •LlamaIndex
Install the packages:
npm install llamaindex bottleneck
npm install -D typescript ts-node @types/node
Step-by-Step
- •Start by creating a limiter that controls both concurrency and request spacing. For production systems, this is better than a naive
setTimeoutbecause it gives you queueing, retries, and burst control.
import Bottleneck from "bottleneck";
export const apiLimiter = new Bottleneck({
maxConcurrent: 2,
minTime: 250,
});
export async function withRateLimit<T>(fn: () => Promise<T>): Promise<T> {
return apiLimiter.schedule(fn);
}
- •Configure LlamaIndex to use the limiter through wrapper classes. The key idea is to keep LlamaIndex code unchanged while forcing every model call through the same queue.
import { Settings, OpenAI, OpenAIEmbedding } from "llamaindex";
import { withRateLimit } from "./rateLimiter";
class RateLimitedOpenAI extends OpenAI {
async chat(params: Parameters<OpenAI["chat"]>[0]) {
return withRateLimit(() => super.chat(params));
}
async complete(params: Parameters<OpenAI["complete"]>[0]) {
return withRateLimit(() => super.complete(params));
}
}
class RateLimitedOpenAIEmbedding extends OpenAIEmbedding {
async getTextEmbedding(text: string) {
return withRateLimit(() => super.getTextEmbedding(text));
}
async getTextEmbeddings(texts: string[]) {
return withRateLimit(() => super.getTextEmbeddings(texts));
}
}
Settings.llm = new RateLimitedOpenAI({
model: "gpt-4o-mini",
});
Settings.embedModel = new RateLimitedOpenAIEmbedding({
model: "text-embedding-3-small",
});
- •If you’re running multiple requests in parallel, push them through the limiter instead of calling them directly. This pattern matters when you ingest documents, generate summaries, or run tool-heavy agents that fan out work.
import { Document, VectorStoreIndex } from "llamaindex";
import { Settings } from "llamaindex";
async function main() {
const docs = [
new Document({ text: "LlamaIndex helps build RAG systems." }),
new Document({ text: "Rate limiting protects upstream APIs." }),
new Document({ text: "TypeScript gives strong typing for agents." }),
];
const index = await VectorStoreIndex.fromDocuments(docs);
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({
query: "Why is rate limiting important?",
});
console.log(response.toString());
}
main().catch(console.error);
- •Add retry handling for real-world throttling. Rate limiting reduces pressure, but you still want backoff for occasional provider-side bursts or shared tenant limits.
async function retryWithBackoff<T>(
fn: () => Promise<T>,
attempts = 3,
): Promise<T> {
let lastError: unknown;
for (let i = 0; i < attempts; i++) {
try {
return await fn();
} catch (error) {
lastError = error;
const delayMs = Math.pow(2, i) * 500;
await new Promise((resolve) => setTimeout(resolve, delayMs));
}
}
throw lastError;
}
- •Combine retry logic with the limiter when calling LlamaIndex operations that may trigger multiple upstream requests. This gives you controlled throughput plus resilience under load.
import { QueryEngine } from "llamaindex";
async function safeQuery(queryEngine: QueryEngine, question: string) {
return retryWithBackoff(() =>
withRateLimit(() =>
queryEngine.query({
query: question,
}),
),
);
}
async function run() {
// assume an existing queryEngine from your index setup
}
Testing It
Run several queries at once and watch the limiter queue them instead of firing all requests immediately. If you log timestamps around each call, you should see roughly one request every minTime milliseconds per limiter instance, with only maxConcurrent active at once.
A good test is to lower minTime to something obvious like 1000, then fire five parallel queries and confirm the total runtime increases predictably. If your provider starts returning fewer 429 errors under load, the setup is doing its job.
For deeper verification, temporarily wrap the scheduled function with logging so you can see when jobs enter and leave the queue. That gives you a clear signal that LlamaIndex calls are being serialized through Bottleneck instead of bypassing it.
Next Steps
- •Add per-route limiters so embeddings, chat completions, and tool calls each have their own budget.
- •Persist queue metrics to Prometheus or OpenTelemetry so you can alert on backlog growth.
- •Extend this pattern to streaming responses and background ingestion workers.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit