How to Fix 'intermittent 500 errors in production' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-21

intermittent-500-errors-in-productionautogenpython

Intermittent 500 errors in AutoGen usually mean your agent pipeline is failing on some requests but not all. In practice, this shows up when a tool call, model response, or message payload occasionally violates an assumption in your code or in the AutoGen runtime.

The key word is intermittent. That usually points to state leakage, nondeterministic tool behavior, rate limits, or malformed messages that only happen for certain inputs.

The Most Common Cause — shared mutable state in agent or tool code

The #1 cause I see is a tool function or callback mutating shared state across requests. In AutoGen, that often surfaces as a generic server-side failure after an exception like:

•TypeError: Object of type datetime is not JSON serializable
•KeyError: 'content'
•AttributeError: 'NoneType' object has no attribute ...
•autogen.exception.InvalidChatMessageError

If the same agent instance is reused across requests and it keeps mutable globals, one bad turn can poison the next one.

Broken vs fixed pattern

Broken pattern	Fixed pattern
Reuses shared mutable state	Creates request-scoped state
Returns non-serializable objects	Returns plain JSON-safe primitives
Lets exceptions bubble into 500s	Catches and converts to explicit tool errors

# BROKEN
from autogen import AssistantAgent

history = []  # shared across requests

def lookup_customer(customer_id):
    # returns a datetime object -> often breaks serialization later
    return {"customer_id": customer_id, "last_seen": datetime.utcnow()}

agent = AssistantAgent(
    name="support_agent",
    llm_config={"config_list": [{"model": "gpt-4o-mini", "api_key": "..." }]},
)

def handle_request(customer_id):
    result = lookup_customer(customer_id)
    history.append(result)  # shared mutation
    return agent.generate_reply(messages=history)

# FIXED
from autogen import AssistantAgent

def lookup_customer(customer_id):
    # return JSON-safe values only
    return {
        "customer_id": customer_id,
        "last_seen": datetime.utcnow().isoformat()
    }

def handle_request(customer_id):
    local_history = []
    agent = AssistantAgent(
        name="support_agent",
        llm_config={"config_list": [{"model": "gpt-4o-mini", "api_key": "..."}]},
    )

    try:
        result = lookup_customer(customer_id)
        local_history.append({"role": "user", "content": f"Lookup {customer_id}"})
        local_history.append({"role": "tool", "content": str(result)})
        return agent.generate_reply(messages=local_history)
    except Exception as e:
        return {"error": str(e)}

If you’re running AutoGen behind FastAPI, Flask, or a job worker, do not reuse a single global message list or cache raw tool outputs unless you control the schema tightly.

Other Possible Causes

1) Invalid chat message shape

AutoGen expects messages with valid roles and content. If one request sends None, missing keys, or nested objects where strings are expected, you may see failures like:

•InvalidChatMessageError
•KeyError: 'role'
•TypeError: string indices must be integers

# BAD
messages = [
    {"role": "user", "content": None},
]

# GOOD
messages = [
    {"role": "user", "content": "Check policy status for policy 12345"},
]

2) Tool function throws on edge cases

A tool that works for most inputs can still fail on one customer record, one empty field, or one unexpected enum value.

# BAD
def get_claim_amount(claim):
    return float(claim["amount"])  # crashes if amount is "", None, or "$100"

# GOOD
def get_claim_amount(claim):
    raw = claim.get("amount")
    if raw in (None, ""):
        return {"error": "missing amount"}
    try:
        return {"amount": float(raw)}
    except ValueError:
        return {"error": f"invalid amount: {raw}"}

3) Model rate limiting or transient upstream failures

Intermittent 500 errors can be caused by upstream API instability. In AutoGen this may appear after retries exhaust and the app collapses into a generic server error.

llm_config = {
    "config_list": [
        {
            "model": "gpt-4o-mini",
            "api_key": "...",
            "timeout": 60,
            "temperature": 0,
        }
    ],
    "cache_seed": 42,
}

Add retry logic around the request boundary if your version of AutoGen doesn’t already do enough retrying for your workload.

4) Token overflow from growing conversation history

Long-running conversations can blow past context limits and fail only after enough turns. That looks intermittent because it depends on session length.

# BAD: unbounded history growth
conversation.append(new_message)

# GOOD: trim history before each call
conversation = conversation[-12:]

If you use GroupChat or nested agents, keep an eye on accumulated context from system messages, tool outputs, and summaries.

How to Debug It

•
Capture the full stack trace
- •Don’t stop at the HTTP 500.
- •Look for the real exception underneath: InvalidChatMessageError, JSONDecodeError, RateLimitError, or a tool-side exception.
•
Log the exact input payload
- •Print the last user message, full message list shape, and tool output right before calling AutoGen.
- •
  You want to confirm every message has:
  - •role
  - •content
  - •JSON-safe values
•
Disable tools and test the LLM path only
- •If the error disappears when tools are removed, your bug is in tool code or serialization.
- •If it still happens, inspect message formatting and model config.
•
Run with one request per fresh agent instance
- •If fresh instances fix it, you have shared-state contamination.
- •That usually means globals, reused histories, cached mutable objects, or thread-safety issues.

Prevention

•Keep all per-request state local to the handler.
•Return only JSON-safe data from tools: strings, numbers, booleans, lists, dicts.
•Trim conversation history aggressively before it grows beyond your prompt budget.
•Wrap every external dependency with explicit error handling and structured logs.

If you’re building this in production, treat AutoGen agents like stateless workers. The moment you let request data leak across sessions, intermittent 500 errors become inevitable.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit