LlamaIndex Tutorial (Python): rate limiting API calls for intermediate developers
This tutorial shows you how to add rate limiting around LlamaIndex API calls in Python so your app stops hammering OpenAI, Anthropic, or any other model provider when traffic spikes. You need this when you’re running multi-user apps, background jobs, or agent workflows that can trigger bursts of requests and get you throttled.
What You'll Need
- •Python 3.10+
- •
llama-index - •A model provider package, for example:
- •
openai - •
anthropic
- •
- •An API key for the provider you want to call
- •Basic familiarity with LlamaIndex
Settings,QueryEngine, andChatEngine - •A place to store environment variables, such as a
.envfile
Install the packages:
pip install llama-index openai
Step-by-Step
- •Start by setting up a simple LlamaIndex app that can make LLM calls. We’ll use a small local index so the rate limiter is easy to test without needing a full production dataset.
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "")
Settings.llm = OpenAI(model="gpt-4o-mini")
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the main topic of these documents.")
print(response)
- •Add a reusable token-bucket rate limiter. This version is process-local, thread-safe enough for typical single-process apps, and good for controlling request bursts before they hit the provider.
import time
from threading import Lock
class TokenBucketRateLimiter:
def __init__(self, rate_per_second: float, capacity: int):
self.rate_per_second = rate_per_second
self.capacity = capacity
self.tokens = capacity
self.updated_at = time.monotonic()
self.lock = Lock()
def acquire(self) -> None:
while True:
with self.lock:
now = time.monotonic()
elapsed = now - self.updated_at
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate_per_second)
self.updated_at = now
if self.tokens >= 1:
self.tokens -= 1
return
wait_time = (1 - self.tokens) / self.rate_per_second
time.sleep(wait_time)
- •Wrap your LlamaIndex query calls with the limiter. This keeps the limiter outside LlamaIndex internals, which is usually the cleanest place to control traffic in production.
from typing import Any
class RateLimitedQueryEngine:
def __init__(self, query_engine: Any, limiter: TokenBucketRateLimiter):
self.query_engine = query_engine
self.limiter = limiter
def query(self, prompt: str):
self.limiter.acquire()
return self.query_engine.query(prompt)
limiter = TokenBucketRateLimiter(rate_per_second=2.0, capacity=2)
rate_limited_query_engine = RateLimitedQueryEngine(query_engine, limiter)
for i in range(5):
result = rate_limited_query_engine.query(f"What is document {i} about?")
print(f"Request {i + 1}: {result}")
- •If you want better control over retries and backoff, combine rate limiting with explicit retry handling for provider-side throttling errors. This matters because client-side limiting reduces pressure, but it does not eliminate every 429 from upstream services.
import time
from openai import RateLimitError
def safe_query(prompt: str, retries: int = 3):
for attempt in range(retries):
try:
limiter.acquire()
return query_engine.query(prompt)
except RateLimitError:
if attempt == retries - 1:
raise
sleep_for = 2 ** attempt
time.sleep(sleep_for)
print(safe_query("Give me a concise summary of the documents."))
- •For multi-user systems, move from one global limiter to per-user or per-tenant limiters. That prevents one noisy user from consuming all available request budget.
from collections import defaultdict
tenant_limiters = defaultdict(lambda: TokenBucketRateLimiter(rate_per_second=1.0, capacity=2))
def tenant_query(tenant_id: str, prompt: str):
tenant_limiters[tenant_id].acquire()
return query_engine.query(prompt)
print(tenant_query("tenant-a", "What are the key points?"))
print(tenant_query("tenant-b", "What are the key points?"))
Testing It
Run the script and fire several queries back-to-back. With a limiter set to rate_per_second=2.0 and capacity=2, the first two requests should go through immediately, and later ones should pause before continuing.
If you want proof beyond timing by eye, print timestamps before and after each call and confirm spacing between requests matches your configured rate. Also test with a lower limit like rate_per_second=0.5 so the delay is obvious.
If you’re using retries, temporarily reduce your provider quota or simulate throttling by forcing exceptions in your wrapper. You want to verify two things: your code waits before calling the model, and it backs off cleanly when the provider still returns a 429.
Next Steps
- •Add distributed rate limiting with Redis if you run multiple app instances.
- •Wire the limiter into an agent workflow so tool calls and LLM calls share one budget.
- •Add observability: log wait times, retry counts, and per-tenant request volume.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit