How to Fix 'rate limit exceeded' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceededllamaindexpython

If you hit RateLimitError: Error code: 429 - {'error': {'message': 'rate limit exceeded'}} while using LlamaIndex in Python, you’re not dealing with a LlamaIndex bug. You’re hitting the upstream model provider’s quota or request throttling.

This usually shows up when you run ingestion, query loops, or agent workflows that fire too many LLM calls too quickly. In practice, it’s almost always caused by bad batching, parallelism, or an undersized rate limit for the model account.

The Most Common Cause

The #1 cause is calling the LLM inside a tight loop without controlling concurrency or retries.

With LlamaIndex, this often happens during document ingestion or when you query each chunk one-by-one. The broken pattern looks harmless until it starts hammering the provider API.

Broken patternFixed pattern
Calls the model for every item immediatelyBatches work and adds retry/backoff
No concurrency controlLimits parallel requests
No cachingRepeats identical prompts
# BROKEN
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini")

docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)

# This can trigger lots of rapid calls depending on your workflow
query_engine = index.as_query_engine(llm=llm)

for question in [
    "Summarize the policy",
    "What is excluded?",
    "What is the deductible?",
    "How do claims work?",
]:
    print(query_engine.query(question))
# FIXED
import time
from tenacity import retry, wait_exponential, stop_after_attempt
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini")
docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(llm=llm)

@retry(wait=wait_exponential(min=1, max=20), stop=stop_after_attempt(5))
def safe_query(q: str):
    return query_engine.query(q)

questions = [
    "Summarize the policy",
    "What is excluded?",
    "What is the deductible?",
    "How do claims work?",
]

for question in questions:
    print(safe_query(question))
    time.sleep(1.5)  # simple client-side pacing

If you’re using agents or workflows, the same issue shows up as repeated tool calls or nested LLM calls from a single user request. That multiplies request volume fast.

Other Possible Causes

1. Your provider quota is actually exhausted

This is common with OpenAI, Anthropic, Azure OpenAI, and similar providers. The error may be a real account-level limit, not just burst throttling.

# Example: inspect provider-side limits in your dashboard
# OpenAI / Anthropic / Azure: check RPM, TPM, daily spend caps

What to check:

  • Requests per minute (RPM)
  • Tokens per minute (TPM)
  • Daily spend limit
  • Model-specific quota

2. Parallel ingestion is too aggressive

LlamaIndex ingestion pipelines can fan out requests if you process many nodes at once. If you set high concurrency in your own code, you can overwhelm the API quickly.

# Too aggressive
import asyncio

async def ingest_many(chunks):
    await asyncio.gather(*[process_chunk(c) for c in chunks])
# Safer: limit concurrency with a semaphore
import asyncio

sem = asyncio.Semaphore(3)

async def process_chunk(chunk):
    async with sem:
        return await llm.apredict(f"Summarize: {chunk}")

3. You’re re-indexing on every request

A common production mistake is rebuilding VectorStoreIndex inside an API endpoint. That creates extra embedding and generation traffic on every call.

# BROKEN: rebuilds index every request
def answer հարց(question):
    docs = SimpleDirectoryReader("data").load_data()
    index = VectorStoreIndex.from_documents(docs)
    return index.as_query_engine().query(question)
# FIXED: build once at startup and reuse
docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()

def answer(question):
    return query_engine.query(question)

4. Your prompt/input size is causing retries and token spikes

Large context windows can burn through TPM limits fast. If your chunks are too big or your retriever returns too many nodes, one query can become several expensive calls.

from llama_index.core import Settings

Settings.chunk_size = 512   # smaller chunks reduce prompt bloat
Settings.chunk_overlap = 50

Also check retriever settings:

query_engine = index.as_query_engine(similarity_top_k=3)  # avoid huge context injection

How to Debug It

  1. Read the exact exception Look for RateLimitError, HTTP 429, and provider-specific text like:

    • rate limit exceeded
    • You exceeded your current quota
    • Please reduce your request rate
  2. Log every LLM call Count how many times LlamaIndex hits the model per user action. If one request triggers 10+ completions unexpectedly, you’ve found your problem.

  3. Check whether it happens only under load Run a single request manually. Then run 10 concurrent requests. If failures only appear under concurrency, throttle your workers.

  4. Inspect your LlamaIndex flow Look for:

    • index creation inside request handlers
    • asyncio.gather() without limits
    • agents looping over tools repeatedly
    • large similarity_top_k values

A quick test harness helps isolate it:

for i in range(20):
    try:
        print(query_engine.query(f"Question {i}"))
    except Exception as e:
        print(type(e).__name__, str(e))
        break

If the error appears after a predictable number of calls, it’s usually quota or burst throttling.

Prevention

  • Reuse indexes and query engines instead of rebuilding them per request.
  • Add exponential backoff retries around all LLM calls that go through LlamaIndex.
  • Cap concurrency for ingestion jobs and batch processing.
  • Keep chunk sizes and retrieved context small enough to stay under token limits.
  • Monitor provider RPM/TPM usage before shipping to production.

If you want this to stay stable in production, treat rate limits as part of system design, not an edge case. The fix is usually less about changing LlamaIndex and more about controlling how often your app asks the model to do work.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides