How to Fix 'rate limit exceeded in production' in CrewAI (Python)
What this error means
rate limit exceeded in production usually means CrewAI is calling an LLM provider faster than the provider allows. In practice, this shows up when you run multiple agents/tasks, retry too aggressively, or create a loop that keeps firing requests without backoff.
The failure often appears as a provider error wrapped by CrewAI, for example:
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit exceeded'}}
or
litellm.RateLimitError: RateLimitError: OpenAIException - rate_limit_exceeded
The Most Common Cause
The #1 cause is uncontrolled parallelism: too many agents, too many tasks, or too many retries hitting the same model at once.
A common mistake is spinning up several Agent/Task calls without controlling concurrency or request volume.
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Fires multiple LLM calls at once | Limits concurrency and adds retry/backoff |
| No pacing between tasks | Uses sequential execution or a queue |
| Recreates agents repeatedly | Reuses agents and shared config |
# broken.py
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
researcher = Agent(
role="Researcher",
goal="Research the customer issue",
backstory="Senior analyst",
llm=llm,
)
writer = Agent(
role="Writer",
goal="Write the report",
backstory="Technical writer",
llm=llm,
)
tasks = [
Task(description="Summarize incident A", agent=researcher),
Task(description="Summarize incident B", agent=researcher),
Task(description="Draft final report", agent=writer),
]
crew = Crew(
agents=[researcher, writer],
tasks=tasks,
process=Process.parallel, # problem: bursts requests together
)
result = crew.kickoff()
print(result)
# fixed.py
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4o-mini",
temperature=0,
max_retries=5,
)
researcher = Agent(
role="Researcher",
goal="Research the customer issue",
backstory="Senior analyst",
llm=llm,
)
writer = Agent(
role="Writer",
goal="Write the report",
backstory="Technical writer",
llm=llm,
)
tasks = [
Task(description="Summarize incident A", agent=researcher),
Task(description="Summarize incident B", agent=researcher),
Task(description="Draft final report", agent=writer),
]
crew = Crew(
agents=[researcher, writer],
tasks=tasks,
process=Process.sequential, # reduce burst traffic
)
result = crew.kickoff()
print(result)
If you need parallelism, add your own throttling around task submission. Don’t assume CrewAI will protect you from provider quotas.
Other Possible Causes
1) Too many retries with no backoff
If your app retries instantly on every 429, you create a self-inflicted traffic spike.
# bad: immediate retry loop
for _ in range(10):
try:
result = crew.kickoff()
break
except Exception:
continue
Use exponential backoff and cap attempts:
import time
delay = 1
for attempt in range(5):
try:
result = crew.kickoff()
break
except Exception as e:
if "rate limit" not in str(e).lower():
raise
time.sleep(delay)
delay *= 2
2) Model choice is too small for your traffic
A low-tier model can hit quota faster under load. If your app moved from local testing to production traffic, the same code may start failing.
llm = ChatOpenAI(model="gpt-4o-mini") # fine for dev, but quota may be tight in prod
Fix by moving critical paths to a higher-capacity deployment or using separate keys/projects for production workloads.
3) You are creating new client instances per request
Rebuilding ChatOpenAI, Agent, and Crew objects inside every web request increases overhead and can amplify bursts.
# bad inside FastAPI route or Flask handler
def handle_request():
llm = ChatOpenAI(model="gpt-4o-mini")
agent = Agent(role="Analyst", goal="Analyze", backstory="...", llm=llm)
Prefer shared singletons at app startup:
# good: create once at startup/module scope
llm = ChatOpenAI(model="gpt-4o-mini")
agent = Agent(role="Analyst", goal="Analyze", backstory="...", llm=llm)
4) Your prompt is causing runaway tool usage
An agent stuck in a tool loop can generate dozens of hidden LLM calls. In CrewAI this often looks like normal execution until the provider starts returning 429s.
agent = Agent(
role="Support triage",
goal="Keep searching until certain",
backstory="...",
)
Tighten the objective and set explicit stop conditions in your task design. If a tool keeps returning empty results, fail fast instead of looping forever.
How to Debug It
- •
Check whether it’s one task or many
- •Run a single
Taskwith oneAgent. - •If that works but the full crew fails, you have concurrency or volume issues.
- •Run a single
- •
Log the exact exception
- •Look for
openai.RateLimitError,litellm.RateLimitError, or HTTP429. - •The provider name tells you whether this is OpenAI, Azure OpenAI, Anthropic via LiteLLM, or another backend.
- •Look for
- •
Disable parallel execution
- •Switch
Process.paralleltoProcess.sequential. - •If the error disappears, you were bursting requests too hard.
- •Switch
- •
Inspect retries and tool loops
- •Count how many times each task invokes the model.
- •If one task makes repeated calls, add limits or simplify its prompt/tooling.
Prevention
- •Use
Process.sequentialunless you have measured headroom for parallel runs. - •Add exponential backoff for all provider-facing retries.
- •Reuse LLM clients and avoid creating new agents/crews inside hot request paths.
- •Track request counts per user/job so one workflow cannot burn through your entire quota.
If you’re running CrewAI in production behind an API server, treat LLM calls like any other constrained dependency. Control concurrency first; everything else is secondary.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit