How to Fix 'chain execution stuck in production' in CrewAI (Python)
What this error usually means
If your CrewAI chain is “stuck in production,” the process is usually not dead — it’s waiting forever on a task that never completes, a tool call that never returns, or an agent loop that has no exit condition. In practice, this shows up as a request hanging, a worker sitting at 100% CPU, or logs stopping right after Task started or Running crew....
The common symptom is that your app never gets past crew.kickoff() and you don’t see a clean exception like ValidationError or TimeoutError. Instead, the chain just sits there until your web server times out or your job runner kills it.
The Most Common Cause
The #1 cause is an agent/task loop with no hard stop: infinite delegation, missing output constraints, or a tool that blocks forever. In CrewAI, this often happens when an Agent keeps reasoning without producing a final answer, especially if you allow delegation and don’t set strong task boundaries.
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Agent can delegate forever | Agent has explicit stop conditions |
| Tool call has no timeout | Tool call has timeout |
| Task asks for “analyze” with no output format | Task requires exact deliverable |
# BROKEN: can hang in production
from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool
search_tool = SerperDevTool()
researcher = Agent(
role="Researcher",
goal="Find everything about the topic",
backstory="You are thorough and never stop until complete.",
tools=[search_tool],
allow_delegation=True, # risky if not controlled
verbose=True,
)
task = Task(
description="Research the topic and analyze all relevant information.",
expected_output="A detailed analysis.",
agent=researcher,
)
crew = Crew(
agents=[researcher],
tasks=[task],
process=Process.sequential,
verbose=True,
)
result = crew.kickoff()
print(result)
# FIXED: bounded execution
from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool
search_tool = SerperDevTool()
researcher = Agent(
role="Researcher",
goal="Find exactly 3 relevant sources and summarize them",
backstory="You produce concise outputs and stop when requirements are met.",
tools=[search_tool],
allow_delegation=False,
verbose=True,
)
task = Task(
description=(
"Find exactly 3 relevant sources about the topic. "
"Return only: title, source URL, and 2-line summary for each."
),
expected_output=(
"A markdown list with exactly 3 items. "
"No extra commentary."
),
agent=researcher,
)
crew = Crew(
agents=[researcher],
tasks=[task],
process=Process.sequential,
verbose=True,
)
result = crew.kickoff()
print(result)
What changed:
- •
allow_delegation=Falseremoves open-ended back-and-forth. - •The task now has an exact deliverable.
- •The agent goal is bounded.
- •The output format is strict enough for the model to terminate.
If you’re using tools like search, browser automation, database queries, or HTTP clients, also make sure they have timeouts. A hanging tool call looks like a stuck CrewAI chain from the outside.
Other Possible Causes
1) A tool call blocks forever
This is common with custom Python tools that do network I/O without timeouts.
# BAD
import requests
def fetch_customer_data(customer_id):
return requests.get(f"https://api.example.com/customers/{customer_id}").json()
# GOOD
import requests
def fetch_customer_data(customer_id):
resp = requests.get(
f"https://api.example.com/customers/{customer_id}",
timeout=10,
)
resp.raise_for_status()
return resp.json()
2) Recursive delegation between agents
Two agents can bounce tasks back and forth if both are allowed to delegate.
# BAD
agent_a = Agent(..., allow_delegation=True)
agent_b = Agent(..., allow_delegation=True)
Fix it by allowing delegation on only one side, or disabling it entirely unless you really need it.
# GOOD
agent_a = Agent(..., allow_delegation=False)
agent_b = Agent(..., allow_delegation=True)
3) Your task prompt is too vague
A vague prompt like “do a full analysis” gives the model no stopping point.
# BAD
Task(
description="Analyze customer churn risk.",
expected_output="Good analysis.",
)
# GOOD
Task(
description=(
"Analyze customer churn risk using only these fields: "
"tenure, plan_type, support_tickets_last_90d."
"Return 5 bullet points and a final risk score from 1 to 10."
),
expected_output="5 bullets + one integer risk score.",
)
4) Model/provider latency or rate limiting looks like a hang
Sometimes the chain is not stuck; your LLM provider is slow or retrying behind the scenes.
llm_config = {
"model": "gpt-4o-mini",
}
Use explicit retries and timeouts at the client layer where possible.
llm_config = {
"model": "gpt-4o-mini",
"timeout": 30,
}
If you’re wrapping an SDK yourself, add logging around every LLM call so you can see whether the stall happens before or after the request leaves your app.
How to Debug It
- •
Turn on verbose logging
- •Set
verbose=Trueon bothAgentandCrew. - •Add logs before and after
crew.kickoff(). - •If logs stop inside a tool function, the bug is in that tool.
- •Set
- •
Remove tools first
- •Run the same crew with no tools attached.
- •If it finishes cleanly, one of your tools is blocking.
- •Add tools back one by one until it hangs again.
- •
Disable delegation
- •Set
allow_delegation=Falseon every agent. - •If the hang disappears, you had an agent loop.
- •Re-enable delegation only where needed.
- •Set
- •
Make output deterministic
- •Change vague tasks into strict deliverables.
- •Require counts, schemas, or bullet limits.
- •If needed, ask for JSON so you can validate completion quickly.
Example debug wrapper:
print("Starting crew...")
result = crew.kickoff()
print("Crew finished:", result)
If "Starting crew..." prints but "Crew finished" never does, then focus on:
- •tool timeouts
- •delegation loops
- •provider latency
Prevention
- •
Set hard timeouts on every external dependency:
- •HTTP requests
- •browser automation
- •database calls
- •internal APIs
- •
Keep tasks narrow:
- •one task should produce one artifact
- •avoid “analyze everything” prompts
- •
Treat delegation as opt-in:
- •default to
allow_delegation=False - •enable it only for workflows that truly need multi-agent handoffs
- •default to
If you’re running CrewAI in production behind FastAPI, Celery, or Kubernetes jobs, add request-level deadlines too. A stuck chain should fail fast with a useful log line instead of tying up workers indefinitely.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit