How to Fix 'rate limit exceeded in production' in LlamaIndex (Python)
What the error means
rate limit exceeded in production usually means your app is sending more LLM or embedding requests than the provider allows in a given time window. In LlamaIndex, this often shows up as a wrapped provider exception like openai.RateLimitError, anthropic.RateLimitError, or a generic 429 Too Many Requests during indexing, query-time retrieval, or agent loops.
You’ll hit this most often when you run batch ingestion, recursive query engines, or parallel workers without throttling.
The Most Common Cause
The #1 cause is uncontrolled concurrency. LlamaIndex makes it easy to fan out requests through VectorStoreIndex.from_documents(), async query engines, or custom loops, and that can overload the provider fast.
Here’s the broken pattern and the fixed pattern side by side:
| Broken | Fixed |
|---|---|
| Sends too many requests at once | Limits concurrency and adds retries/backoff |
# BROKEN: unbounded parallelism during ingestion/querying
import asyncio
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import Document
docs = [Document(text=f"Doc {i}") for i in range(500)]
async def build_and_query():
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
tasks = [
query_engine.aquery(f"What is in doc {i}?")
for i in range(100)
]
results = await asyncio.gather(*tasks) # can trigger 429s
return results
# FIXED: throttle concurrency and retry transient 429s
import asyncio
from tenacity import retry, wait_exponential_jitter, stop_after_attempt
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import Document
docs = [Document(text=f"Doc {i}") for i in range(500)]
sem = asyncio.Semaphore(5)
@retry(wait=wait_exponential_jitter(initial=1, max=30), stop=stop_after_attempt(5))
async def safe_query(query_engine, prompt: str):
async with sem:
return await query_engine.aquery(prompt)
async def build_and_query():
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
tasks = [
safe_query(query_engine, f"What is in doc {i}?")
for i in range(100)
]
return await asyncio.gather(*tasks)
If you’re using OpenAI through LlamaIndex, the underlying exception often looks like:
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit exceeded', ...}}
For Anthropic you may see:
anthropic.RateLimitError: Error code: 429 - {'type': 'rate_limit_error', ...}
Other Possible Causes
1. Embedding calls are the real bottleneck
People usually blame chat completions, but indexing large corpora hits embedding limits first.
# BAD: huge batch ingestion without chunking or pacing
index = VectorStoreIndex.from_documents(big_docs)
# BETTER: smaller batches and persistent storage
from llama_index.core import StorageContext, VectorStoreIndex
storage_context = StorageContext.from_defaults()
for batch in batched(big_docs, size=50):
VectorStoreIndex.from_documents(batch, storage_context=storage_context)
If your embedding model has strict RPM/TPM limits, also reduce chunk count by tuning splitting.
2. You’re recreating the client on every request
If you instantiate models inside request handlers, you can lose connection reuse and multiply traffic under load.
# BAD: new LLM object per request
def handle_request(user_input: str):
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini")
return llm.complete(user_input)
# BETTER: create once and reuse
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini")
def handle_request(user_input: str):
return llm.complete(user_input)
3. Your agent loop is making repeated tool calls
Agents can hammer the model if tools return weak signals or prompts cause retries.
# BAD: agent keeps re-querying because tool output is noisy/empty
agent.chat("Investigate customer issue and keep digging until certain.")
Fix it by capping iterations and tightening tool instructions:
agent = agent.as_agent(
max_iterations=3,
)
Also make tools deterministic and return structured output where possible.
4. Your provider limits are lower than your traffic pattern
Sometimes the code is fine; your plan just cannot support production load.
# Example config review checklist
OPENAI_API_KEY=...
OPENAI_MODEL=gpt-4o-mini
# but traffic spikes to 20 RPS with multiple workers
If you run multiple replicas, each one gets its own idea of “safe” throughput. Three pods each doing 5 concurrent requests still means 15 concurrent requests against the same account.
How to Debug It
- •
Find the exact exception source
- •Log the full stack trace.
- •Check whether it’s
openai.RateLimitError,anthropic.RateLimitError, or a wrapper fromllama_index.core.base.llms.types.
- •
Separate ingestion from query traffic
- •If it happens during startup or batch jobs, it’s likely embeddings.
- •If it happens under user load, inspect query concurrency and agent loops.
- •
Measure request volume per minute
- •Count outbound LLM calls.
- •Count embedding calls.
- •Compare that against your provider’s RPM/TPM limits.
- •
Disable parallelism temporarily
- •Run one worker.
- •Set concurrency to 1.
- •If the error disappears, you’ve confirmed a load problem rather than a bad prompt or model bug.
Prevention
- •Add exponential backoff with jitter around every LlamaIndex call that hits a provider API.
- •Cap concurrency with semaphores or worker queues; don’t rely on default async behavior.
- •Reuse model clients and persist indexes so you don’t rebuild embeddings on every process start.
- •Put hard limits on agent iterations and tool recursion depth.
- •Monitor request counts per route so you catch spikes before they turn into
429 Too Many Requestserrors.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit