How to Fix 'rate limit exceeded when scaling' in CrewAI (Python)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-when-scalingcrewaipython

What this error means

rate limit exceeded when scaling usually means your CrewAI app is creating too many LLM calls too quickly, and the provider is throttling you. It shows up most often when you increase max_iter, run more agents/tasks in parallel, or spin up multiple crews at once.

In practice, this is not a CrewAI bug. It’s usually a concurrency and retry problem between your agent workload and the model provider’s API limits.

The Most Common Cause

The #1 cause is uncontrolled parallelism: too many agents, tasks, or retries hitting the same model at the same time.

A common broken pattern is to scale out agents without limiting request pressure:

# broken.py
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

agents = [
    Agent(role=f"Agent {i}", goal="Do research", backstory="You are helpful", llm=llm)
    for i in range(20)
]

tasks = [
    Task(description=f"Research topic {i}", agent=agents[i])
    for i in range(20)
]

crew = Crew(
    agents=agents,
    tasks=tasks,
    process=Process.sequential,  # still can hit limits if each task fans out internally
)

result = crew.kickoff()
print(result)

When this scales, you may see errors like:

  • openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit exceeded'}}
  • litellm.RateLimitError
  • CrewAIException: rate limit exceeded when scaling

The fix is to reduce concurrent pressure and add explicit throttling/retry control:

# fixed.py
import time
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    max_retries=3,
)

def build_crew(batch):
    agents = [
        Agent(role=f"Agent {i}", goal="Do research", backstory="You are helpful", llm=llm)
        for i in batch
    ]

    tasks = [
        Task(description=f"Research topic {i}", agent=agents[idx])
        for idx, i in enumerate(batch)
    ]

    return Crew(
        agents=agents,
        tasks=tasks,
        process=Process.sequential,
    )

items = list(range(20))
batch_size = 5

for start in range(0, len(items), batch_size):
    batch = items[start:start + batch_size]
    crew = build_crew(batch)
    result = crew.kickoff()
    print(result)
    time.sleep(2)  # simple backoff between batches

The key difference:

  • Broken: all work is launched together
  • Fixed: work is batched and retried with backoff

If you’re using Process.hierarchical, the issue gets worse because manager-agent planning can create extra model calls before task execution even starts.

Other Possible Causes

1) Your provider quota is actually too low

Sometimes the code is fine and your OpenAI/Anthropic/etc. account simply cannot handle the request volume.

# check your model/provider limits before scaling
llm = ChatOpenAI(model="gpt-4o-mini")  # cheap model, but still rate-limited by tier

What to check:

  • requests per minute (RPM)
  • tokens per minute (TPM)
  • concurrent requests allowed
  • project-level vs org-level limits

2) max_iter is too high on agents

Each extra iteration means more LLM calls. If an agent keeps looping on a task, you multiply traffic fast.

agent = Agent(
    role="Analyst",
    goal="Summarize data",
    backstory="You are precise",
    llm=llm,
    max_iter=10,   # risky at scale
)

Safer pattern:

agent = Agent(
    role="Analyst",
    goal="Summarize data",
    backstory="You are precise",
    llm=llm,
    max_iter=3,
)

3) Too many crews running in parallel from your app layer

Even if one crew is sequential, your FastAPI worker pool or Celery queue may be launching many crews at once.

# bad: no concurrency guard around kickoff()
results = [crew.kickoff() for crew in crews]

Use a queue or semaphore around kickoff calls:

from threading import Semaphore

limit = Semaphore(2)

def run_crew(crew):
    with limit:
        return crew.kickoff()

4) Retries are amplifying the problem

If your SDK retries aggressively on 429, a small burst becomes a bigger burst.

llm = ChatOpenAI(
    model="gpt-4o-mini",
    max_retries=10,  # can make throttling worse during spikes
)

Prefer lower retry counts plus exponential backoff outside the model client.

How to Debug It

  1. Confirm where the 429 comes from

    • Check logs for openai.RateLimitError, anthropic.RateLimitError, or litellm.RateLimitError.
    • If it appears inside CrewAI after kickoff(), inspect the underlying provider exception.
  2. Measure how many LLM calls one task creates

    • Turn on debug logging.
    • Count calls per agent/task.
    • Look for loops caused by tool use, reflection steps, or high max_iter.
  3. Reduce concurrency to 1

    • Run one crew with one task.
    • If it works, increase batch size slowly.
    • If it fails immediately, your provider quota or prompt size is likely the issue.
  4. Inspect token usage and prompt size

    • Long context windows can hit TPM before RPM.
    • Large tool outputs are common culprits.
    • Trim memory, shorten system prompts, and summarize tool results before passing them forward.

Prevention

  • Keep agent settings conservative:

    • low max_iter
    • controlled tool usage
    • sequential processing unless parallelism is necessary
  • Add request shaping at the application layer:

    • semaphores for concurrent crews
    • batching for large workloads
    • exponential backoff on 429
  • Monitor provider limits before production rollout:

    • RPM/TPM dashboards
    • error rates by model name
    • per-task call counts so you know which workflow is expensive

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides