How to Fix 'OOM error during inference in production' in AutoGen (Python)
An OOM error during inference in AutoGen usually means your process ran out of memory while building prompts, storing chat history, or loading a model. In practice, this shows up when an agent conversation grows too large, when you keep too many messages in memory, or when the model backend itself needs more RAM/VRAM than the machine has.
The Most Common Cause
The #1 cause is unbounded conversation history. In AutoGen, AssistantAgent, UserProxyAgent, and multi-agent chats will keep appending messages unless you trim them. If you keep passing the full transcript back into inference, token count grows until the backend throws an OOM-style failure.
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| ```python | |
| from autogen import AssistantAgent, UserProxyAgent |
assistant = AssistantAgent( name="assistant", llm_config={"config_list": [{"model": "gpt-4o-mini", "api_key": "..." }]} )
user = UserProxyAgent(name="user")
Keeps reusing the same growing chat history
for i in range(100):
user.initiate_chat(
assistant,
message=f"Round {i}: summarize this data"
)
|python
from autogen import AssistantAgent, UserProxyAgent
assistant = AssistantAgent( name="assistant", llm_config={"config_list": [{"model": "gpt-4o-mini", "api_key": "..."}]} )
user = UserProxyAgent(name="user")
Reset or scope each run
for i in range(100): user.initiate_chat( assistant, message=f"Round {i}: summarize this data", clear_history=True )
If you are using lower-level chat APIs, the same rule applies: do not keep appending every prior message forever.
```python
# Broken: unbounded message list
messages.append({"role": "user", "content": payload})
reply = assistant.generate_reply(messages=messages)
# Fixed: trim history before inference
messages = messages[-10:]
reply = assistant.generate_reply(messages=messages)
If your backend is local, you may see errors like:
- •
RuntimeError: CUDA out of memory - •
torch.cuda.OutOfMemoryError - •
MemoryError - •
OOM during inference
Those are symptoms. The root cause is usually prompt growth or oversized context.
Other Possible Causes
1) Model too large for the host
If you run a local model through AutoGen, the model may simply not fit in RAM/VRAM.
llm_config = {
"config_list": [
{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"base_url": "http://localhost:8000",
"api_key": "local"
}
]
}
On a small GPU box, that can fail immediately with:
- •
CUDA out of memory - •
failed to allocate tensor - •
out of memory allocating
Fix by choosing a smaller model or quantized variant.
llm_config = {
"config_list": [
{
"model": "meta-llama/Llama-3.1-8B-Instruct-GGUF",
"base_url": "http://localhost:8000",
"api_key": "local"
}
]
}
2) Too many agents running at once
AutoGen group chats can multiply memory use because each agent keeps state and messages.
from autogen import GroupChat, GroupChatManager
groupchat = GroupChat(
agents=[agent1, agent2, agent3, agent4, agent5],
messages=[],
max_round=50
)
manager = GroupChatManager(groupchat=groupchat)
Reduce the number of agents or cap rounds aggressively.
groupchat = GroupChat(
agents=[agent1, agent2],
messages=[],
max_round=10
)
3) Tool outputs are too large
A tool returning a giant JSON blob or dataframe preview gets fed back into the next prompt.
def fetch_records():
return huge_json_blob # thousands of lines
assistant.register_for_llm(name="fetch_records")(fetch_records)
Trim tool output before returning it to the model.
def fetch_records():
rows = query_db()
return rows[:20] # return only what the model needs
4) Long system prompts and verbose instructions
A huge system_message eats context before the conversation even starts.
assistant = AssistantAgent(
name="assistant",
system_message=open("policy_manual.txt").read(),
llm_config=llm_config,
)
Keep system prompts tight and move long policy text into retrieval or external lookup.
How to Debug It
- •
Check whether memory grows with each turn
- •Watch RSS/VRAM while repeating one chat loop.
- •If usage climbs every round, you have unbounded history or large tool payloads.
- •
Inspect message length before inference
- •Log token estimates and message count.
- •In AutoGen flows, print
len(messages)or inspect the conversation object before calling the LLM.
- •
Test with a tiny prompt and one turn
- •Run a single
AssistantAgentreply with no tools. - •If that works but multi-turn fails, the issue is accumulation rather than model size.
- •Run a single
- •
Swap to a smaller model
- •If a smaller model works under identical prompts, your original model is too large for available memory.
- •This is common with local deployments behind vLLM, Ollama, TGI, or custom OpenAI-compatible servers.
Prevention
- •
Cap conversation growth:
- •Use
clear_history=True - •Trim old messages
- •Summarize instead of replaying full transcripts
- •Use
- •
Keep tool outputs small:
- •Return top-k rows only
- •Strip raw logs and giant JSON payloads
- •Don’t pass entire files back into chat unless necessary
- •
Match model size to hardware:
- •Use quantized models on smaller GPUs
- •Set realistic context lengths
- •Load-test before production traffic hits it
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit