How to Fix 'state not updating when scaling' in AutoGen (Python)
What this error means
If you’re seeing state not updating when scaling in AutoGen, it usually means your agent state is changing in one process, but the scaled-out worker that handles the next step is reading a different copy of that state. This shows up when you move from a single Python process to multiple workers, async tasks, or distributed execution.
In practice, it happens when you store conversation state in memory and then expect it to survive across agent calls, retries, or replicas.
The Most Common Cause
The #1 cause is mutable state living inside a local Python object that gets recreated or copied during scaling.
With AutoGen, people often keep chat history or session data on AssistantAgent, UserProxyAgent, or a custom wrapper class, then run it behind multiprocessing, Celery, FastAPI workers, or any scale-out setup. The code works locally, then breaks once requests land on different workers.
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| State stored only in process memory | State persisted in shared storage |
| Worker A updates state | Worker B reads stale state |
GroupChatManager sees old messages | Rehydrate state before each turn |
# BROKEN: state is local to one Python process
from autogen import AssistantAgent, UserProxyAgent
class ChatService:
def __init__(self):
self.messages = [] # lost when another worker handles the next request
self.assistant = AssistantAgent(
name="assistant",
llm_config={"config_list": [{"model": "gpt-4o-mini", "api_key": "..." }]}
)
self.user = UserProxyAgent(name="user")
def handle(self, text: str):
self.messages.append({"role": "user", "content": text})
reply = self.assistant.generate_reply(messages=self.messages)
self.messages.append({"role": "assistant", "content": reply})
return reply
# FIXED: persist and reload state per session
from autogen import AssistantAgent
import json
from pathlib import Path
STATE_DIR = Path("./chat_state")
STATE_DIR.mkdir(exist_ok=True)
class ChatService:
def __init__(self):
self.assistant = AssistantAgent(
name="assistant",
llm_config={"config_list": [{"model": "gpt-4o-mini", "api_key": "..."}]}
)
def _state_path(self, session_id: str) -> Path:
return STATE_DIR / f"{session_id}.json"
def load_messages(self, session_id: str):
path = self._state_path(session_id)
return json.loads(path.read_text()) if path.exists() else []
def save_messages(self, session_id: str, messages):
self._state_path(session_id).write_text(json.dumps(messages))
def handle(self, session_id: str, text: str):
messages = self.load_messages(session_id)
messages.append({"role": "user", "content": text})
reply = self.assistant.generate_reply(messages=messages)
messages.append({"role": "assistant", "content": reply})
self.save_messages(session_id, messages)
return reply
If you’re running multiple replicas, use Redis/Postgres/S3 instead of local files. The important part is that the next worker can reconstruct the same conversation state.
Other Possible Causes
1) You’re mutating a copied dict instead of the original object
This happens when you pass around dict.copy(), deepcopy, or serialize/deserialize between steps.
# BAD
state = {"turns": []}
worker_state = state.copy()
worker_state["turns"].append("hello")
# GOOD
state["turns"].append("hello")
If your AutoGen orchestration uses message dicts directly, make sure each step writes back to the canonical store.
2) You’re using async tasks without awaiting the write
In async pipelines, the read happens before the write completes.
# BAD
async def run_turn(store):
store.save_async("session-1", {"step": 1}) # not awaited
state = await store.load("session-1") # stale read
# GOOD
async def run_turn(store):
await store.save_async("session-1", {"step": 1})
state = await store.load("session-1")
With AutoGen agents wrapped in FastAPI endpoints or background tasks, this is a common race.
3) Your group chat manager is rebuilt every request
If you create a new GroupChat or GroupChatManager on each call without reloading prior messages, the manager starts from zero every time.
from autogen import GroupChat, GroupChatManager
# BAD: fresh manager every request
groupchat = GroupChat(agents=[...], messages=[])
manager = GroupChatManager(groupchat=groupchat)
# GOOD: restore previous messages for the session
groupchat = GroupChat(agents=[...], messages=load_session_messages(session_id))
manager = GroupChatManager(groupchat=groupchat)
4) Your cache key is wrong across workers
A lot of “state not updating” bugs are actually bad session routing. If worker A saves under one key and worker B loads under another key, it looks like AutoGen lost state.
# BAD
session_key = request.headers.get("X-Session") # sometimes missing / inconsistent
# GOOD
session_key = f"{tenant_id}:{user_id}:{conversation_id}"
Use a stable key derived from authenticated identity and conversation ID. Don’t rely on ephemeral headers unless they’re guaranteed by your gateway.
How to Debug It
- •
Print the worker identity
- •Log hostname, PID, container ID, or thread ID on every request.
- •If consecutive turns hit different workers and you use memory-only storage, you found the issue.
- •
Log state before and after each agent call
- •Dump message count and last message role/content.
- •Compare what AutoGen receives versus what your app thinks it saved.
- •
Check whether persistence is real
- •Write a value.
- •Restart the process.
- •Read it back.
- •If it disappears after restart, it was never persistent storage.
- •
Reproduce with one worker
- •Run with a single Uvicorn worker or one container replica.
- •If the bug vanishes, your problem is almost certainly shared-state handling rather than AutoGen itself.
Prevention
- •Keep conversation state outside process memory if requests can land on different workers.
- •Use one canonical session key for all reads and writes.
- •Treat
AssistantAgent,UserProxyAgent,GroupChat, andGroupChatManageras runtime objects; persist only the data they need to rebuild context. - •Add logs for session ID, worker ID, message count, and storage backend on every turn.
If you want this to stop showing up in production incidents, design for rehydration from day one. In AutoGen systems that scale horizontally, “state” is not an object attribute — it’s data in shared storage.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit