How to Fix 'rate limit exceeded in production' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-in-productionllamaindexpython

What the error means

rate limit exceeded in production usually means your app is sending more LLM or embedding requests than the provider allows in a given time window. In LlamaIndex, this often shows up as a wrapped provider exception like openai.RateLimitError, anthropic.RateLimitError, or a generic 429 Too Many Requests during indexing, query-time retrieval, or agent loops.

You’ll hit this most often when you run batch ingestion, recursive query engines, or parallel workers without throttling.

The Most Common Cause

The #1 cause is uncontrolled concurrency. LlamaIndex makes it easy to fan out requests through VectorStoreIndex.from_documents(), async query engines, or custom loops, and that can overload the provider fast.

Here’s the broken pattern and the fixed pattern side by side:

BrokenFixed
Sends too many requests at onceLimits concurrency and adds retries/backoff
# BROKEN: unbounded parallelism during ingestion/querying
import asyncio
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import Document

docs = [Document(text=f"Doc {i}") for i in range(500)]

async def build_and_query():
    index = VectorStoreIndex.from_documents(docs)
    query_engine = index.as_query_engine()

    tasks = [
        query_engine.aquery(f"What is in doc {i}?")
        for i in range(100)
    ]
    results = await asyncio.gather(*tasks)  # can trigger 429s
    return results
# FIXED: throttle concurrency and retry transient 429s
import asyncio
from tenacity import retry, wait_exponential_jitter, stop_after_attempt
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import Document

docs = [Document(text=f"Doc {i}") for i in range(500)]
sem = asyncio.Semaphore(5)

@retry(wait=wait_exponential_jitter(initial=1, max=30), stop=stop_after_attempt(5))
async def safe_query(query_engine, prompt: str):
    async with sem:
        return await query_engine.aquery(prompt)

async def build_and_query():
    index = VectorStoreIndex.from_documents(docs)
    query_engine = index.as_query_engine()

    tasks = [
        safe_query(query_engine, f"What is in doc {i}?")
        for i in range(100)
    ]
    return await asyncio.gather(*tasks)

If you’re using OpenAI through LlamaIndex, the underlying exception often looks like:

openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit exceeded', ...}}

For Anthropic you may see:

anthropic.RateLimitError: Error code: 429 - {'type': 'rate_limit_error', ...}

Other Possible Causes

1. Embedding calls are the real bottleneck

People usually blame chat completions, but indexing large corpora hits embedding limits first.

# BAD: huge batch ingestion without chunking or pacing
index = VectorStoreIndex.from_documents(big_docs)
# BETTER: smaller batches and persistent storage
from llama_index.core import StorageContext, VectorStoreIndex

storage_context = StorageContext.from_defaults()
for batch in batched(big_docs, size=50):
    VectorStoreIndex.from_documents(batch, storage_context=storage_context)

If your embedding model has strict RPM/TPM limits, also reduce chunk count by tuning splitting.


2. You’re recreating the client on every request

If you instantiate models inside request handlers, you can lose connection reuse and multiply traffic under load.

# BAD: new LLM object per request
def handle_request(user_input: str):
    from llama_index.llms.openai import OpenAI
    llm = OpenAI(model="gpt-4o-mini")
    return llm.complete(user_input)
# BETTER: create once and reuse
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini")

def handle_request(user_input: str):
    return llm.complete(user_input)

3. Your agent loop is making repeated tool calls

Agents can hammer the model if tools return weak signals or prompts cause retries.

# BAD: agent keeps re-querying because tool output is noisy/empty
agent.chat("Investigate customer issue and keep digging until certain.")

Fix it by capping iterations and tightening tool instructions:

agent = agent.as_agent(
    max_iterations=3,
)

Also make tools deterministic and return structured output where possible.


4. Your provider limits are lower than your traffic pattern

Sometimes the code is fine; your plan just cannot support production load.

# Example config review checklist
OPENAI_API_KEY=...
OPENAI_MODEL=gpt-4o-mini
# but traffic spikes to 20 RPS with multiple workers

If you run multiple replicas, each one gets its own idea of “safe” throughput. Three pods each doing 5 concurrent requests still means 15 concurrent requests against the same account.

How to Debug It

  1. Find the exact exception source

    • Log the full stack trace.
    • Check whether it’s openai.RateLimitError, anthropic.RateLimitError, or a wrapper from llama_index.core.base.llms.types.
  2. Separate ingestion from query traffic

    • If it happens during startup or batch jobs, it’s likely embeddings.
    • If it happens under user load, inspect query concurrency and agent loops.
  3. Measure request volume per minute

    • Count outbound LLM calls.
    • Count embedding calls.
    • Compare that against your provider’s RPM/TPM limits.
  4. Disable parallelism temporarily

    • Run one worker.
    • Set concurrency to 1.
    • If the error disappears, you’ve confirmed a load problem rather than a bad prompt or model bug.

Prevention

  • Add exponential backoff with jitter around every LlamaIndex call that hits a provider API.
  • Cap concurrency with semaphores or worker queues; don’t rely on default async behavior.
  • Reuse model clients and persist indexes so you don’t rebuild embeddings on every process start.
  • Put hard limits on agent iterations and tool recursion depth.
  • Monitor request counts per route so you catch spikes before they turn into 429 Too Many Requests errors.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides