How to Fix 'chain execution stuck when scaling' in CrewAI (Python)
When CrewAI says chain execution stuck when scaling, it usually means your agent workflow is waiting on something that never resolves: a tool call, a task dependency, or a runaway loop in the chain. You’ll see it more often when you move from a single local run to multiple tasks, more agents, or async execution.
In practice, this is rarely a CrewAI bug. It’s usually a bad task graph, blocking code inside a tool, or an agent that keeps re-entering the same step.
The Most Common Cause
The #1 cause is a task or tool that blocks forever, usually because of one of these patterns:
- •recursive tool calls
- •an external API request with no timeout
- •an agent waiting on another agent/task that never returns
- •using
kickoff()inside a tool or task instead of returning data
Here’s the broken pattern I see most often.
| Broken | Fixed |
|---|---|
| A tool calls another crew synchronously and waits forever | The tool returns data; orchestration happens outside the tool |
# BROKEN: nested kickoff inside a tool can deadlock the chain
from crewai import Agent, Task, Crew, Process
from crewai.tools import BaseTool
class LookupAndResearchTool(BaseTool):
name = "lookup_and_research"
description = "Looks up a customer and runs research"
def _run(self, customer_id: str) -> str:
# Bad: starting another crew from inside a tool
research_agent = Agent(
role="Researcher",
goal="Research customer context",
backstory="You research records"
)
research_task = Task(
description=f"Research customer {customer_id}",
agent=research_agent
)
nested_crew = Crew(
agents=[research_agent],
tasks=[research_task],
process=Process.sequential
)
return nested_crew.kickoff() # can hang when scaled
agent = Agent(
role="Ops Agent",
goal="Resolve customer issues",
backstory="You handle support workflows",
tools=[LookupAndResearchTool()]
)
# FIXED: keep tools pure; orchestrate crews at the top level
from crewai import Agent, Task, Crew, Process
from crewai.tools import BaseTool
class LookupCustomerTool(BaseTool):
name = "lookup_customer"
description = "Looks up customer data"
def _run(self, customer_id: str) -> str:
# Return data only. No nested crews.
return f"customer={customer_id}, status=active"
ops_agent = Agent(
role="Ops Agent",
goal="Resolve customer issues",
backstory="You handle support workflows",
tools=[LookupCustomerTool()]
)
research_agent = Agent(
role="Researcher",
goal="Research customer context",
backstory="You research records"
)
task1 = Task(
description="Look up the customer record for ID 12345.",
agent=ops_agent
)
task2 = Task(
description="Summarize the customer context from the lookup result.",
agent=research_agent,
context=[task1]
)
crew = Crew(
agents=[ops_agent, research_agent],
tasks=[task1, task2],
process=Process.sequential
)
result = crew.kickoff()
Why this breaks under scale:
- •nested crews multiply latency and stack depth
- •one blocked call stalls every downstream task
- •retries can create duplicate work and make the “stuck” behavior look random
If you’re using external APIs in tools, add explicit timeouts:
import requests
def _run(self, customer_id: str) -> str:
r = requests.get(
f"https://api.example.com/customers/{customer_id}",
timeout=10,
)
r.raise_for_status()
return r.text
Other Possible Causes
1) Circular task dependencies
If Task B depends on Task A, and Task A also depends on Task B, CrewAI can’t resolve the chain.
task_a = Task(description="Summarize policy", agent=agent_a)
task_b = Task(description="Review summary", agent=agent_b, context=[task_a])
# BAD if you later wire task_a to depend on task_b via context or prompt logic
Fix it by making dependencies one-way only.
task_a = Task(description="Summarize policy", agent=agent_a)
task_b = Task(description="Review summary", agent=agent_b, context=[task_a])
2) Overly large context causing model stalls
When you pass huge outputs between tasks, the LLM can slow down hard or stop producing useful output.
# BAD: dumping an entire PDF extraction into context
task_b = Task(
description="Analyze claim notes",
agent=agent_b,
context=[very_large_task_output]
)
Trim it before passing forward.
# GOOD: summarize first, then pass compact context
summary_task = Task(
description="Summarize claim notes into 10 bullet points",
agent=summarizer_agent,
)
analysis_task = Task(
description="Analyze claim risk from summary",
agent=analyst_agent,
context=[summary_task]
)
3) Async code without proper awaiting
If you mix async tools with sync orchestration incorrectly, tasks may appear stuck.
# BAD: coroutine created but not awaited properly in your wrapper code
async def fetch_data():
...
Use one execution model consistently. If your app is sync, keep tools sync. If you go async, await everything end-to-end.
4) Infinite reasoning loops from loose prompts
Agents with vague goals sometimes keep refining instead of finishing.
agent = Agent(
role="Analyst",
goal="Keep improving the answer until perfect", # BAD
)
Make the termination condition explicit.
agent = Agent(
role="Analyst",
goal="Produce one final risk summary in under 200 words"
)
How to Debug It
- •
Reduce to one agent and one task
- •Remove all tools.
- •Run
Crew(process=Process.sequential)with a singleTask. - •If it works there, the issue is in orchestration or tooling.
- •
Add logging around every tool call
- •Print before and after each
_run(). - •If you see “before” but not “after”, that tool is blocking.
- •Print before and after each
class DebugTool(BaseTool):
name = "debug_tool"
def _run(self, query: str) -> str:
print(f"DEBUG start query={query}")
result = do_work(query)
print("DEBUG end")
return result
- •
Check for circular context
- •Inspect every
Task(context=[...]). - •Make sure no task depends on its own output indirectly.
- •Remove any prompt text that tells an agent to “wait for” another step unless that step is actually wired in.
- •Inspect every
- •
Set hard timeouts and inspect stack traces
- •Wrap network calls with timeouts.
- •Use Python’s traceback plus CrewAI logs to find where execution stops.
- •If you see repeated retries or repeated LLM calls with no final output, you likely have a loop.
Prevention
- •Keep tools pure: tools fetch/transform data; crews orchestrate work
- •Add timeouts to every external request used by an agent or tool
- •Keep task graphs acyclic and simple: one direction only
- •Cap output size before passing results into downstream tasks
If you’re building this for production workflows in banking or insurance, treat every tool as untrusted latency. One blocked HTTP call or one recursive orchestration path is enough to freeze an entire chain once concurrency increases.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit