LlamaIndex Tutorial (TypeScript): rate limiting API calls for intermediate developers
This tutorial shows how to put a hard rate limit around LlamaIndex API calls in TypeScript, so your agent does not exceed provider quotas or trigger burst-related failures. You need this when multiple requests hit the same index from a web app, queue worker, or multi-user assistant.
What You'll Need
- •Node.js 18+
- •A TypeScript project with
ts-nodeor a build step - •
llamaindexinstalled - •An OpenAI API key in
OPENAI_API_KEY - •Optional: a Redis instance if you want distributed rate limiting later
- •Basic familiarity with
VectorStoreIndex,QueryEngine, and async/await
Install the packages:
npm install llamaindex bottleneck dotenv
npm install -D typescript ts-node @types/node
Step-by-Step
- •Start by creating a small LlamaIndex setup that can answer queries from local text. The important part is not the index itself, but the fact that every
.query()call will go through the limiter you add next.
import "dotenv/config";
import { Document, Settings, VectorStoreIndex } from "llamaindex";
async function main() {
Settings.llm = undefined;
const docs = [
new Document({ text: "Topiax handles insurance claims workflows." }),
new Document({ text: "LlamaIndex can wrap retrieval and response generation." }),
];
const index = await VectorStoreIndex.fromDocuments(docs);
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({
query: "What does Topiax handle?",
});
console.log(response.toString());
}
main();
- •Add a limiter with
bottleneck. This gives you a clean place to control concurrency and request rate without changing your LlamaIndex code everywhere.
import Bottleneck from "bottleneck";
export const limiter = new Bottleneck({
maxConcurrent: 1,
minTime: 1200,
});
export async function limited<T>(fn: () => Promise<T>): Promise<T> {
return limiter.schedule(fn);
}
- •Wrap every LlamaIndex call that can hit an external model. In practice, that means
query(),chat(), and any custom retrieval pipeline you expose to your app.
import "dotenv/config";
import { Document, Settings, VectorStoreIndex } from "llamaindex";
import { limiter, limited } from "./limiter";
async function main() {
const docs = [
new Document({ text: "Policy servicing includes endorsements and renewals." }),
new Document({ text: "Claims triage routes cases based on severity." }),
];
const index = await VectorStoreIndex.fromDocuments(docs);
const queryEngine = index.asQueryEngine();
const response = await limited(() =>
queryEngine.query({ query: "What is claims triage?" })
);
console.log(response.toString());
}
main();
- •If you want to protect the underlying LLM directly, set the model once in
Settingsand keep the limiter around your higher-level calls. This is useful when multiple services share the same model settings and you want one consistent throttle point.
import "dotenv/config";
import { OpenAI } from "llamaindex";
import { Settings } from "llamaindex";
Settings.llm = new OpenAI({
model: "gpt-4o-mini",
});
Settings.embedModel = new OpenAI({
model: "text-embedding-3-small",
});
- •For bursty traffic, add retry behavior only after rate limiting is in place. Rate limiting prevents overload; retries handle transient failures like HTTP 429s or network hiccups.
import { limited } from "./limiter";
export async function withRetry<T>(
fn: () => Promise<T>,
attempts = 3
): Promise<T> {
let lastError: unknown;
for (let i = 0; i < attempts; i++) {
try {
return await limited(fn);
} catch (error) {
lastError = error;
await new Promise((r) => setTimeout(r, 500 * (i + 1)));
}
}
throw lastError;
}
- •Use the wrapper everywhere your app calls LlamaIndex. Keep it boring and centralized so future changes like per-user quotas or Redis-backed distributed limits are easy to drop in.
import "dotenv/config";
import { Document, VectorStoreIndex } from "llamaindex";
import { withRetry } from "./retry";
async function main() {
const index = await VectorStoreIndex.fromDocuments([
new Document({ text: "Fraud review requires manual escalation for high-risk cases." }),
new Document({ text: "Customer service bots should stay within provider limits." }),
]);
const engine = index.asQueryEngine();
const result = await withRetry(() =>
engine.query({ query: "When should fraud review escalate?" })
);
console.log(result.toString());
}
main();
Testing It
Run the script several times in parallel and watch the spacing between requests. With minTime: 1200, you should see calls serialized instead of firing in a burst.
If you intentionally lower your provider quota or use a slow network connection, the app should fail less often because requests are being paced before they reach OpenAI. If you still see throttling errors, your limiter is probably too permissive for your actual token usage.
A good test is to log timestamps before and after each scheduled call. You should see each request start roughly one second apart, even when multiple promises are created at once.
Next Steps
- •Add per-user rate limits by creating one
Bottleneckinstance per tenant or account ID - •Move the limiter behind a shared Redis store so multiple Node processes respect the same quota
- •Add token-based budgeting so large queries count more than small ones
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit