How to Fix 'deployment crash in production' in AutoGen (Python)
When AutoGen says deployment crash in production, it usually means the agent process died after starting a model call or tool execution. In practice, this shows up when your Python app is running in a real environment and something in the agent runtime, model config, or tool layer is invalid.
The key thing: this is usually not an AutoGen “bug” by itself. It’s almost always a bad deployment config, a mismatched package version, or code that works locally but fails once it hits production constraints.
The Most Common Cause
The #1 cause is a bad LLM configuration in AssistantAgent or config_list_from_json(). In production, the agent starts fine, then crashes when it tries to resolve the model client because the config is missing a valid provider, API key, or deployment name.
Here’s the broken pattern versus the fixed one.
| Broken | Fixed |
|---|---|
| Uses an empty or incomplete config list | Passes a valid model config with required fields |
| Assumes env vars exist in production | Loads and validates env vars before agent startup |
| Crashes during first LLM call | Fails fast with explicit validation |
# BROKEN
from autogen import AssistantAgent
assistant = AssistantAgent(
name="assistant",
llm_config={
"config_list": [
{
"model": "gpt-4o-mini"
# missing api_key / base_url / api_type depending on provider
}
]
},
)
# This often dies at runtime with errors like:
# openai.AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided'}}
# or:
# ValueError: No config found for model client
# FIXED
import os
from autogen import AssistantAgent
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise RuntimeError("OPENAI_API_KEY is not set")
assistant = AssistantAgent(
name="assistant",
llm_config={
"config_list": [
{
"model": "gpt-4o-mini",
"api_key": api_key,
}
]
},
)
If you’re using Azure OpenAI, the failure mode changes slightly. You’ll often see:
- •
openai.BadRequestError: Error code: 404 - Resource not found - •
ValueError: Missing required field 'base_url' - •
AuthenticationErrorfrom a wrongapi_versionor deployment name
For Azure, make sure the deployment name matches what you created in the portal, not the base model name.
Other Possible Causes
1. Tool function throws an unhandled exception
If your agent calls a Python tool and that tool crashes, AutoGen can bubble it up as a runtime failure that looks like a deployment issue.
def lookup_policy(policy_id: str):
return db["policies"][policy_id] # KeyError if missing
# FIX
def lookup_policy(policy_id: str):
policy = db["policies"].get(policy_id)
if not policy:
return {"error": f"Policy {policy_id} not found"}
return policy
2. Package version mismatch
AutoGen changes quickly. A local environment with one version and production with another can produce class/serialization errors.
Common symptoms:
- •
TypeError: __init__() got an unexpected keyword argument ... - •
ImportError: cannot import name 'AssistantAgent' - •
AttributeErroron newer agent APIs
Pin versions explicitly:
autogen-agentchat==0.2.37
autogen-core==0.2.37
openai==1.40.6
3. Missing network access or blocked outbound traffic
In production, your container may not reach OpenAI/Azure endpoints even though everything works on your laptop.
Config check:
curl https://api.openai.com/v1/models
If this fails in the pod or VM, AutoGen will eventually fail with connection-related exceptions like:
- •
httpx.ConnectError - •
openai.APIConnectionError - •timeout errors during agent chat
4. Context window overload
A long-running conversation can crash or fail once token usage gets too large, especially when agents keep appending full transcripts.
Bad pattern:
messages = messages + [new_message] # unbounded growth
Better pattern:
messages = messages[-20:] + [new_message] # keep only recent turns
You may also see provider-side errors like:
- •
BadRequestError: This model's maximum context length is... - •repeated retries followed by task failure
How to Debug It
- •
Check the exact exception stack trace
- •Don’t stop at “deployment crash in production.”
- •Look for the first real error below AutoGen wrappers like
GroupChatManager,AssistantAgent, orOpenAIWrapper.
- •
Validate your model config before starting agents
- •Print the resolved config at startup.
- •Confirm
api_key,model, and provider-specific fields are present. - •If using Azure, verify
base_url,api_version, and deployment name.
- •
Run the same code path locally inside Docker
- •Production-only failures often come from missing env vars or network restrictions.
- •Reproduce with the same image, same environment variables, and same Python version.
- •
Isolate tools from LLM calls
- •Temporarily disable tools and run only one assistant message.
- •If it works without tools, your crash is likely inside a function call handler rather than AutoGen itself.
Prevention
- •Pin all dependencies and deploy from a lockfile.
- •Validate environment variables at process startup, before creating any AutoGen agents.
- •Add defensive error handling around every tool function exposed to agents.
- •Keep conversation history bounded; don’t let transcripts grow forever.
- •Test the exact container image you ship, not just local Python runs.
If you want a quick rule of thumb: when AutoGen crashes “in production,” assume configuration first, code second, infrastructure third. In most cases I’ve seen, fixing the model config or tool exception removes the failure immediately.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit