LlamaIndex Tutorial (Python): rate limiting API calls for beginners

By Cyprian AaronsUpdated 2026-04-21

llamaindexrate-limiting-api-calls-for-beginnerspython

This tutorial shows you how to rate limit LlamaIndex API calls in Python so your app stops hammering an LLM provider when traffic spikes. You’ll build a simple token-bucket limiter around LlamaIndex’s OpenAI integration, which is useful when you need to control cost, avoid 429 errors, and keep request volume predictable.

What You'll Need

•Python 3.10+
•A working LlamaIndex install
•An OpenAI API key
•pip access to install dependencies
•Basic familiarity with VectorStoreIndex, QueryEngine, and Settings
•Optional: a FastAPI or worker-based app where you want to enforce limits

Install the packages:

pip install llama-index llama-index-llms-openai openai

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

•Start with a plain LlamaIndex setup so you can see where the limiter fits. We’ll use a tiny document set and an OpenAI-backed query engine.

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

docs = [
    Document(text="LlamaIndex helps connect data sources to LLMs."),
    Document(text="Rate limiting protects APIs from bursts and quota overruns."),
]

index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()

•Add a reusable token-bucket rate limiter. This version allows a fixed number of requests per minute and sleeps when the bucket is empty.

import time
import threading


class TokenBucketRateLimiter:
    def __init__(self, rate_per_minute: int, capacity: int | None = None):
        self.rate_per_second = rate_per_minute / 60.0
        self.capacity = capacity or rate_per_minute
        self.tokens = float(self.capacity)
        self.updated_at = time.monotonic()
        self.lock = threading.Lock()

    def acquire(self) -> None:
        while True:
            with self.lock:
                now = time.monotonic()
                elapsed = now - self.updated_at
                self.updated_at = now
                self.tokens = min(
                    self.capacity,
                    self.tokens + elapsed * self.rate_per_second,
                )

                if self.tokens >= 1:
                    self.tokens -= 1
                    return

                wait_time = (1 - self.tokens) / self.rate_per_second

            time.sleep(wait_time)

•Wrap your LlamaIndex query calls with the limiter. This keeps the rest of your application code clean and makes the limit easy to change later.

limiter = TokenBucketRateLimiter(rate_per_minute=6)

def limited_query(question: str) -> str:
    limiter.acquire()
    response = query_engine.query(question)
    return str(response)

questions = [
    "What does LlamaIndex do?",
    "Why use rate limiting?",
    "What happens when APIs get too many requests?",
]

for q in questions:
    print(f"Q: {q}")
    print(f"A: {limited_query(q)}\n")

•If you want more control, add retry handling for provider-side throttling too. Client-side limiting reduces pressure, but real systems still need to handle transient 429s from upstream services.

from openai import RateLimitError


def limited_query_with_retry(question: str, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        limiter.acquire()
        try:
            response = query_engine.query(question)
            return str(response)
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)

print(limited_query_with_retry("Explain rate limiting in one sentence."))

•Use the same pattern for any other LlamaIndex call path that hits an API. The important part is not the wrapper itself, but making sure every outbound request goes through one shared limiter instance.

def limited_chat(prompt: str) -> str:
    limiter.acquire()
    chat_llm = Settings.llm
    result = chat_llm.complete(prompt)
    return result.text

print(limited_chat("Give me a short definition of token bucket rate limiting."))

Testing It

Run the script and watch the spacing between requests when your limit is low, like 6 requests per minute. If you call limited_query() in a loop, it should pause once the bucket runs dry instead of firing all requests at once.

You can verify correctness by logging timestamps before and after each call. For a stricter test, temporarily lower rate_per_minute to 2 and confirm that the third request waits roughly 30 seconds before continuing.

If you see RateLimitError anyway, that usually means your local limit is too loose or another part of your app is calling the API outside this wrapper. In production, make sure every worker process has its own strategy, because in-memory limits only apply inside one Python process.

Next Steps

•Move the limiter into a shared service if you run multiple workers or containers.
•Add per-user or per-tenant quotas instead of one global limit.
•Combine this with async queues so bursts get buffered instead of blocked inline

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit