How to Fix 'rate limit exceeded when scaling' in CrewAI (Python)
What this error means
rate limit exceeded when scaling usually means your CrewAI app is creating too many LLM calls too quickly, and the provider is throttling you. It shows up most often when you increase max_iter, run more agents/tasks in parallel, or spin up multiple crews at once.
In practice, this is not a CrewAI bug. It’s usually a concurrency and retry problem between your agent workload and the model provider’s API limits.
The Most Common Cause
The #1 cause is uncontrolled parallelism: too many agents, tasks, or retries hitting the same model at the same time.
A common broken pattern is to scale out agents without limiting request pressure:
# broken.py
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
agents = [
Agent(role=f"Agent {i}", goal="Do research", backstory="You are helpful", llm=llm)
for i in range(20)
]
tasks = [
Task(description=f"Research topic {i}", agent=agents[i])
for i in range(20)
]
crew = Crew(
agents=agents,
tasks=tasks,
process=Process.sequential, # still can hit limits if each task fans out internally
)
result = crew.kickoff()
print(result)
When this scales, you may see errors like:
- •
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit exceeded'}} - •
litellm.RateLimitError - •
CrewAIException: rate limit exceeded when scaling
The fix is to reduce concurrent pressure and add explicit throttling/retry control:
# fixed.py
import time
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4o-mini",
temperature=0,
max_retries=3,
)
def build_crew(batch):
agents = [
Agent(role=f"Agent {i}", goal="Do research", backstory="You are helpful", llm=llm)
for i in batch
]
tasks = [
Task(description=f"Research topic {i}", agent=agents[idx])
for idx, i in enumerate(batch)
]
return Crew(
agents=agents,
tasks=tasks,
process=Process.sequential,
)
items = list(range(20))
batch_size = 5
for start in range(0, len(items), batch_size):
batch = items[start:start + batch_size]
crew = build_crew(batch)
result = crew.kickoff()
print(result)
time.sleep(2) # simple backoff between batches
The key difference:
- •Broken: all work is launched together
- •Fixed: work is batched and retried with backoff
If you’re using Process.hierarchical, the issue gets worse because manager-agent planning can create extra model calls before task execution even starts.
Other Possible Causes
1) Your provider quota is actually too low
Sometimes the code is fine and your OpenAI/Anthropic/etc. account simply cannot handle the request volume.
# check your model/provider limits before scaling
llm = ChatOpenAI(model="gpt-4o-mini") # cheap model, but still rate-limited by tier
What to check:
- •requests per minute (RPM)
- •tokens per minute (TPM)
- •concurrent requests allowed
- •project-level vs org-level limits
2) max_iter is too high on agents
Each extra iteration means more LLM calls. If an agent keeps looping on a task, you multiply traffic fast.
agent = Agent(
role="Analyst",
goal="Summarize data",
backstory="You are precise",
llm=llm,
max_iter=10, # risky at scale
)
Safer pattern:
agent = Agent(
role="Analyst",
goal="Summarize data",
backstory="You are precise",
llm=llm,
max_iter=3,
)
3) Too many crews running in parallel from your app layer
Even if one crew is sequential, your FastAPI worker pool or Celery queue may be launching many crews at once.
# bad: no concurrency guard around kickoff()
results = [crew.kickoff() for crew in crews]
Use a queue or semaphore around kickoff calls:
from threading import Semaphore
limit = Semaphore(2)
def run_crew(crew):
with limit:
return crew.kickoff()
4) Retries are amplifying the problem
If your SDK retries aggressively on 429, a small burst becomes a bigger burst.
llm = ChatOpenAI(
model="gpt-4o-mini",
max_retries=10, # can make throttling worse during spikes
)
Prefer lower retry counts plus exponential backoff outside the model client.
How to Debug It
- •
Confirm where the 429 comes from
- •Check logs for
openai.RateLimitError,anthropic.RateLimitError, orlitellm.RateLimitError. - •If it appears inside CrewAI after
kickoff(), inspect the underlying provider exception.
- •Check logs for
- •
Measure how many LLM calls one task creates
- •Turn on debug logging.
- •Count calls per agent/task.
- •Look for loops caused by tool use, reflection steps, or high
max_iter.
- •
Reduce concurrency to 1
- •Run one crew with one task.
- •If it works, increase batch size slowly.
- •If it fails immediately, your provider quota or prompt size is likely the issue.
- •
Inspect token usage and prompt size
- •Long context windows can hit TPM before RPM.
- •Large tool outputs are common culprits.
- •Trim memory, shorten system prompts, and summarize tool results before passing them forward.
Prevention
- •
Keep agent settings conservative:
- •low
max_iter - •controlled tool usage
- •sequential processing unless parallelism is necessary
- •low
- •
Add request shaping at the application layer:
- •semaphores for concurrent crews
- •batching for large workloads
- •exponential backoff on
429
- •
Monitor provider limits before production rollout:
- •RPM/TPM dashboards
- •error rates by model name
- •per-task call counts so you know which workflow is expensive
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit