How to Fix 'rate limit exceeded when scaling' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-when-scalingllamaindexpython

When you see rate limit exceeded when scaling in a LlamaIndex Python app, it usually means your code is generating more LLM or embedding requests than the upstream provider allows. This shows up when you move from a small local test to batch indexing, multi-document ingestion, parallel query handling, or an agent loop that fans out requests.

In practice, the failure is rarely “LlamaIndex is broken.” It’s usually your concurrency, chunking, retry policy, or model selection pushing OpenAI, Azure OpenAI, Anthropic, or another provider past its quota.

The Most Common Cause

The #1 cause is uncontrolled parallelism during ingestion or retrieval. A common pattern is creating many tasks at once with asyncio.gather() or running a large batch through VectorStoreIndex.from_documents() without throttling.

Here’s the broken pattern:

BrokenFixed
Fires too many requests at onceLimits concurrency and batches work
import asyncio
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import Document

docs = [Document(text=f"Doc {i}") for i in range(500)]

async def build_index():
    # Broken: this can trigger a burst of embedding + LLM calls
    index = VectorStoreIndex.from_documents(docs)
    return index

asyncio.run(build_index())

And here’s the fixed version:

import asyncio
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import Document

docs = [Document(text=f"Doc {i}") for i in range(500)]

async def build_index():
    # Better: process in smaller batches
    batch_size = 25
    index = None

    for i in range(0, len(docs), batch_size):
        batch = docs[i : i + batch_size]
        if index is None:
            index = VectorStoreIndex.from_documents(batch)
        else:
            index.insert_nodes(index._service_context.node_parser.get_nodes_from_documents(batch))

    return index

asyncio.run(build_index())

If you are using an async pipeline, also cap concurrency explicitly:

import asyncio

sem = asyncio.Semaphore(4)

async def bounded_embed(doc):
    async with sem:
        return await embed_document(doc)

results = await asyncio.gather(*(bounded_embed(doc) for doc in docs))

If your stack is hitting OpenAI limits, the error often surfaces as:

  • openai.RateLimitError: Error code: 429
  • Rate limit reached for gpt-4o
  • Too many requests in 1 minute
  • llama_index.core.indices.query.query_transform.base style retry failures after repeated 429s

Other Possible Causes

1) Chunking is too aggressive

If your chunk size is tiny, LlamaIndex creates too many chunks and therefore too many embedding calls.

from llama_index.core.node_parser import SentenceSplitter

# Bad: too small chunks create excessive request volume
splitter = SentenceSplitter(chunk_size=128, chunk_overlap=20)

# Better: fewer chunks means fewer embeddings
splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=100)

2) Query fan-out from agents or routers

Agentic workflows can multiply requests fast. A single user query may trigger tool calls, sub-queries, reranking, and synthesis passes.

# Example of high fan-out behavior
from llama_index.core.agent import ReActAgent

agent = ReActAgent.from_tools(tools)

# If each tool call triggers retrieval + synthesis,
# one user request can become 10+ provider calls.
response = agent.chat("Summarize all claims across these 50 policies")

Fix by reducing tool count, disabling unnecessary sub-question generation, or adding caching around repeated retrieval paths.

3) No retry/backoff policy

A transient 429 becomes a hard failure if you do not retry with exponential backoff.

import time
import random

def retry_with_backoff(fn, max_retries=5):
    for attempt in range(max_retries):
        try:
            return fn()
        except Exception as e:
            if "RateLimitError" not in type(e).__name__:
                raise
            sleep_s = (2 ** attempt) + random.random()
            time.sleep(sleep_s)
    raise RuntimeError("Exceeded retries after rate limit errors")

If you are using LlamaIndex callbacks or custom wrappers around OpenAI, AzureOpenAI, or Anthropic, put the retry there instead of scattering it across business logic.

4) Model mismatch for the workload

Using a large chat model for every step increases request cost and makes rate limits easier to hit.

from llama_index.llms.openai import OpenAI

# Bad: expensive model everywhere
llm = OpenAI(model="gpt-4o")

# Better: use a smaller model for routing/extraction tasks
llm = OpenAI(model="gpt-4o-mini")

For embeddings, same rule applies. If you are embedding millions of chunks with a slow provider tier, you will hit quotas faster than expected.

How to Debug It

  1. Confirm where the 429 comes from

    • Check whether the exception is from embeddings, retrieval synthesis, agent tool execution, or indexing.
    • Look for class names like openai.RateLimitError, RateLimitError, or provider-specific exceptions wrapped by LlamaIndex.
  2. Log request volume per phase

    • Count how many calls happen during ingestion vs querying.
    • Add logging around document loading, node parsing, embedding generation, and response synthesis.
  3. Reduce concurrency to 1

    • Run the same workload serially.
    • If the error disappears, your issue is almost certainly parallelism or burst traffic.
  4. Inspect chunk counts

    • Print how many nodes are being created from each document.
    • If one PDF becomes thousands of nodes because of tiny chunks, that’s your bottleneck.
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=512)
nodes = splitter.get_nodes_from_documents(docs)
print(f"Created {len(nodes)} nodes")

Prevention

  • Use bounded concurrency everywhere you call LLMs or embeddings.
  • Tune chunk sizes so ingestion does not explode into thousands of small requests.
  • Add retries with exponential backoff at the provider boundary.
  • Cache repeated retrieval and synthesis results when workflows fan out.
  • Separate “cheap” models for routing/extraction from “expensive” models for final answers.

If this error only appears when scaling up, treat it as a capacity problem first. In LlamaIndex apps, rate limits are usually a symptom of request shape: too many calls, too fast, from too many places.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides