How to Fix 'deployment crash when scaling' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-21
deployment-crash-when-scalingautogenpython

When AutoGen crashes on deployment while scaling, it usually means one of your agent processes is starting fine in local mode but fails once multiple workers, containers, or replicas are involved. In practice, this shows up when stateful agent objects, non-serializable configs, or missing runtime dependencies get pushed into an environment that expects clean process startup.

The error often appears during container boot, autoscaling, or when a queue worker tries to instantiate AssistantAgent, UserProxyAgent, or GroupChatManager more than once. The fix is usually not in the model call itself — it’s in how you initialize and share state across processes.

The Most Common Cause

The #1 cause is creating AutoGen agents at import time or keeping them as global singletons in a scaled environment. That works in a single Python process, but breaks when Gunicorn, Kubernetes, Celery, or multiple Uvicorn workers try to fork or serialize the app.

Typical symptoms include:

  • RuntimeError: Event loop is closed
  • TypeError: cannot pickle '_thread.lock' object
  • ValueError: I/O operation on closed file
  • autogen.exception.InvalidConfigError

Broken vs fixed pattern

Broken patternFixed pattern
Instantiate agents globally at module importCreate agents inside a factory per request/job
Reuse one chat manager across workersBuild a fresh runtime per process
Store open sockets / file handles inside agent configKeep config JSON-serializable
# broken.py
from autogen import AssistantAgent, UserProxyAgent

llm_config = {
    "model": "gpt-4o-mini",
    "api_key": open("/etc/secrets/openai_key.txt").read().strip(),  # bad: file handle pattern often leaks into startup logic
}

assistant = AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)

user_proxy = UserProxyAgent(
    name="user_proxy",
    code_execution_config={"work_dir": "tmp"},
)

# This gets imported once locally, but can explode under multi-worker scaling.
# fixed.py
from autogen import AssistantAgent, UserProxyAgent

def build_agents():
    llm_config = {
        "model": "gpt-4o-mini",
        "api_key": os.environ["OPENAI_API_KEY"],
    }

    assistant = AssistantAgent(
        name="assistant",
        llm_config=llm_config,
    )

    user_proxy = UserProxyAgent(
        name="user_proxy",
        code_execution_config={"work_dir": "/tmp/autogen"},
    )

    return assistant, user_proxy

If you’re using FastAPI or a worker queue, call build_agents() inside the request handler or job function. Do not cache live agent instances across process boundaries.

Other Possible Causes

1) Non-serializable objects in config

AutoGen configs are often passed through JSON, env vars, or task payloads. If you stuff a Python object into llm_config or code_execution_config, scaling will fail when the worker tries to reload it.

# bad
llm_config = {
    "model": "gpt-4o-mini",
    "client": some_openai_client_object,
}
# good
llm_config = {
    "model": "gpt-4o-mini",
    "api_key": os.environ["OPENAI_API_KEY"],
}

2) Shared working directory collisions

If every replica writes to the same temp path, one worker can delete files another worker still needs. This shows up with code execution and tool use.

# risky
code_execution_config={"work_dir": "/tmp/autogen"}

Use a unique work dir per run:

import tempfile

code_execution_config={
    "work_dir": tempfile.mkdtemp(prefix="autogen_")
}

3) Missing system dependencies in the deployment image

A local machine may have Python packages installed that your container does not. AutoGen code execution can fail with messages like:

  • FileNotFoundError: [Errno 2] No such file or directory: 'python'
  • subprocess.CalledProcessError
  • docker.errors.DockerException

If your agents execute code, make sure the image includes what the runtime expects:

RUN apt-get update && apt-get install -y python3 python3-pip git \
    && pip install -U pyautogen docker

4) Model client initialization happening before env vars are loaded

A common deployment bug is reading secrets too early. If OPENAI_API_KEY is injected at runtime by the platform, importing the module before env injection gives you:

  • KeyError: 'OPENAI_API_KEY'
  • autogen.exception.InvalidConfigError: api_key must be provided

Bad:

API_KEY = os.environ["OPENAI_API_KEY"]  # import-time failure in some deploys

Better:

def get_llm_config():
    return {
        "model": os.getenv("OPENAI_MODEL", "gpt-4o-mini"),
        "api_key": os.environ["OPENAI_API_KEY"],
    }

How to Debug It

  1. Check whether the crash happens on import or on first request

    • If your app dies before serving traffic, look for global agent initialization.
    • If it dies only when scaled replicas receive traffic, suspect shared state or temp directory collisions.
  2. Read the exact traceback for serialization and subprocess clues

    • Search for:
      • cannot pickle
      • Event loop is closed
      • InvalidConfigError
      • CalledProcessError
    • These usually point directly to config shape or execution environment issues.
  3. Run one worker locally and then multiple workers

    • Compare:
      uvicorn app:app --workers 1
      uvicorn app:app --workers 4
      
    • If only multi-worker crashes, your issue is almost always global state or fork-safety.
  4. Log agent construction separately from task execution

    • Add logs around:
      • loading secrets
      • building AssistantAgent
      • building UserProxyAgent
      • creating chat managers
    • If construction fails before any message exchange, focus on startup and config validation.

Prevention

  • Build AutoGen agents inside functions, not at module import time.
  • Keep all configs JSON-friendly: strings, numbers, booleans, lists, dicts.
  • Use unique temp directories and isolate per-request execution state.
  • Test both single-worker and multi-worker deployment modes before shipping.
  • Pin AutoGen and runtime dependencies in your container image so prod matches local behavior.

If you’re seeing this error specifically after scaling from one replica to many, assume shared state first. In AutoGen Python deployments, that’s where most of these crashes start.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides