How to Fix 'chain execution stuck when scaling' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-21
chain-execution-stuck-when-scalingautogenpython

What this error usually means

chain execution stuck when scaling in AutoGen usually means your agent graph is waiting on a step that never completes, or the runtime can’t make progress because one of the agents keeps re-entering the same path. In practice, it shows up when you move from a single happy-path demo to multiple agents, nested chats, tool calls, or async execution.

The symptom is often not a hard crash. You’ll see the conversation freeze, repeated tool invocations, or logs that stop after something like AssistantAgent or GroupChatManager starts a turn and never finishes.

The Most Common Cause

The #1 cause is an infinite or self-reinforcing handoff loop between agents. In AutoGen, this usually happens when your termination condition is missing or your speaker selection logic keeps routing back to the same agent.

A common broken pattern is a group chat where agents can keep passing control forever:

# Broken: no real termination condition, agents can loop forever
from autogen import AssistantAgent, UserProxyAgent, GroupChat, GroupChatManager

llm_config = {"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}

assistant = AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)

coder = AssistantAgent(
    name="coder",
    llm_config=llm_config,
)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
)

groupchat = GroupChat(
    agents=[user, assistant, coder],
    messages=[],
    max_round=50,
)

manager = GroupChatManager(groupchat=groupchat, llm_config=llm_config)

user.initiate_chat(manager, message="Build me a retry helper")

The problem here is simple: nothing tells the chat when to stop. If both assistants keep producing “next steps,” the manager keeps selecting speakers and the chain looks stuck.

The fixed version adds explicit termination and narrows who can speak:

# Fixed: explicit stop condition and tighter routing
from autogen import AssistantAgent, UserProxyAgent, GroupChat, GroupChatManager

llm_config = {"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}

assistant = AssistantAgent(name="assistant", llm_config=llm_config)
coder = AssistantAgent(name="coder", llm_config=llm_config)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
)

def is_termination_msg(msg):
    content = msg.get("content", "")
    return "TERMINATE" in content.upper()

groupchat = GroupChat(
    agents=[user, assistant, coder],
    messages=[],
    max_round=10,
    speaker_selection_method="round_robin",
)

manager = GroupChatManager(
    groupchat=groupchat,
    llm_config=llm_config,
)

user.initiate_chat(manager, message="Build me a retry helper. Reply TERMINATE when done.")

For production systems, I prefer one of these patterns:

  • Explicit termination token like TERMINATE
  • Hard max_round cap
  • Deterministic speaker selection instead of free-form auto-routing

If you’re using GroupChatManager with auto speaker selection and no stop rule, scaling just makes the loop easier to trigger.

Other Possible Causes

1. Blocking tool calls inside an async flow

If a tool blocks the event loop, AutoGen appears frozen even though it’s actually waiting on I/O.

# Broken
import time

def slow_tool(query: str):
    time.sleep(30)
    return f"Result for {query}"
# Fixed
import asyncio

async def slow_tool(query: str):
    await asyncio.sleep(30)
    return f"Result for {query}"

If you’re using asyncio, every long-running tool should be async-friendly.

2. Recursive agent callbacks

An agent that calls back into the same manager or triggers another chat from inside its own response handler can deadlock the chain.

# Broken pattern
def on_message(msg):
    return manager.run_chat(msg["content"])  # re-enters same flow

Instead, separate orchestration from response generation:

# Fixed pattern
def on_message(msg):
    return {"next_action": "summarize", "content": msg["content"]}

3. Misconfigured human_input_mode

If UserProxyAgent waits for input in a non-interactive environment, execution stalls with no obvious exception.

# Broken in CI / server runtime
user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="ALWAYS",
)

Use non-interactive mode for automation:

user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
)

4. Context explosion from oversized message history

When scaling to more agents, message history grows fast. Large histories can slow model calls until they look stuck.

ProblemBad settingBetter setting
Unlimited context growthfull history passed every turntrimmed context
Too many roundsmax_round=100max_round=10-20
No summarizationraw transcript onlysummarize between phases

A practical fix is to summarize before handing off to another agent:

summary_prompt = "Summarize this conversation in 5 bullets."

How to Debug It

  1. Turn on verbose logging

    • Look for where execution stops.
    • You want to identify whether it hangs during model inference, tool execution, or speaker selection.
    • In many cases you’ll see repeated lines around GroupChatManager or AssistantAgent.
  2. Reduce to two agents

    • Remove every extra agent except one assistant and one user proxy.
    • If the issue disappears, your bug is in routing or termination logic.
    • If it still hangs, inspect tools and async behavior next.
  3. Disable tools temporarily

    • Comment out all registered functions.
    • If the chain starts moving again, one tool is blocking or recursively calling back into AutoGen.
  4. Set hard limits

    • Add max_round, timeouts around API calls, and explicit stop tokens.
    • If a limit triggers instead of a freeze, you’ve confirmed an infinite loop rather than a network issue.

Example diagnostic config:

groupchat = GroupChat(
    agents=[user_proxy, assistant],
    messages=[],
    max_round=6,
)

Prevention

  • Always define an explicit termination rule for multi-agent workflows.
  • Keep tools async-safe and avoid blocking calls like time.sleep() inside request paths.
  • Start with deterministic routing; only use dynamic speaker selection after you’ve proven the flow terminates.

If you’re building with AutoGen at scale, treat every agent handoff like production workflow orchestration. No termination rule plus loose routing is how you get “stuck” chains that only show up under load.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides