How to Fix 'OOM error during inference' in AutoGen (Python)
When AutoGen throws OOM error during inference, it means the model backend ran out of memory while processing a prompt, tool call, or response generation. In practice, this usually shows up when you send too much conversation history, use a model with a small context window, or let agents accumulate state across many turns.
The failure is often not in AutoGen itself. It’s usually the combination of AssistantAgent, long chat history, large tool outputs, and an undersized model/runtime.
The Most Common Cause
The #1 cause is unbounded conversation growth. In AutoGen, people often keep appending messages to the same messages list or reuse the same agent chat state across multiple runs, which causes inference payloads to balloon until the backend crashes.
Here’s the broken pattern versus the fixed pattern:
| Broken pattern | Fixed pattern |
|---|---|
| Reuse full history forever | Trim history or summarize before inference |
| Pass large tool output directly into chat | Store it externally and pass only a reference |
Let AssistantAgent see every prior turn | Reset or cap context per task |
# BROKEN: full history keeps growing
from autogen import AssistantAgent
assistant = AssistantAgent(
name="assistant",
llm_config={
"config_list": [{"model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"]}]
},
)
messages = []
for i in range(100):
messages.append({"role": "user", "content": f"Turn {i}: analyze this report..."})
reply = assistant.generate_reply(messages=messages)
messages.append({"role": "assistant", "content": reply})
# FIXED: cap history and keep only recent context
from autogen import AssistantAgent
assistant = AssistantAgent(
name="assistant",
llm_config={
"config_list": [{"model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"]}]
},
)
def trim_messages(messages, keep_last=8):
return messages[-keep_last:]
messages = []
for i in range(100):
messages.append({"role": "user", "content": f"Turn {i}: analyze this report..."})
trimmed = trim_messages(messages)
reply = assistant.generate_reply(messages=trimmed)
messages.append({"role": "assistant", "content": reply})
If you’re using GroupChat or ConversableAgent, the same issue applies. The agent framework will happily carry forward everything unless you explicitly limit it.
Other Possible Causes
1) Tool output is too large
A common failure mode is returning raw PDFs, logs, HTML pages, or giant JSON blobs from a tool and feeding them straight back into AssistantAgent.
# BAD: massive tool result goes directly into context
def fetch_claim_history(claim_id):
return open(f"/tmp/{claim_id}.json").read()
tool_result = fetch_claim_history("CLM-123")
messages.append({"role": "tool", "content": tool_result})
Fix it by summarizing or storing the payload elsewhere.
# GOOD: store raw data externally, pass a summary
tool_result = fetch_claim_history("CLM-123")
summary = summarize_claim_history(tool_result) # your own summarizer
messages.append({"role": "tool", "content": summary})
2) Model context window is too small
If you’re using a smaller model, even moderate prompts can trigger memory pressure. This often shows up as backend errors like:
- •
OOM error during inference - •
CUDA out of memory - •
RuntimeError: out of memory - •provider-specific 500/503 failures during generation
Example config that can be too tight:
llm_config = {
"config_list": [
{"model": "llama3-8b", "api_key": "..."}
],
"temperature": 0.2,
}
Use a model with more headroom or reduce prompt size.
llm_config = {
"config_list": [
{"model": "gpt-4o-mini", "api_key": "..."}
],
"temperature": 0.2,
}
3) You are sending oversized system prompts
Some teams pack policy docs, product manuals, and workflow rules into one giant system message. That works until it doesn’t.
system_prompt = open("all_company_policies.txt").read()
assistant = AssistantAgent(
name="assistant",
system_message=system_prompt,
llm_config=llm_config,
)
Break it up and only inject what’s needed for the task.
assistant = AssistantAgent(
name="assistant",
system_message="You are an insurance claims triage assistant.",
llm_config=llm_config,
)
4) Parallel agents are competing for memory
With GroupChat, multiple agents can generate long responses at once. If each one has large context plus tool output, you multiply memory usage fast.
from autogen import GroupChat, GroupChatManager
groupchat = GroupChat(agents=[agent1, agent2, agent3], messages=[], max_round=20)
manager = GroupChatManager(groupchat=groupchat, llm_config=llm_config)
Reduce rounds and constrain each agent’s prompt size.
groupchat = GroupChat(agents=[agent1, agent2], messages=[], max_round=6)
manager = GroupChatManager(groupchat=groupchat, llm_config=llm_config)
How to Debug It
- •
Reproduce with minimal input
- •Start with one user message and no tools.
- •If it works, add history back incrementally.
- •If it fails immediately, your model/backend config is likely the problem.
- •
Log message sizes
- •Print token counts or approximate character lengths before every inference call.
- •Watch for one huge tool response or a runaway conversation buffer.
for m in messages:
print(m["role"], len(m["content"]))
- •
Disable tools and multi-agent flow
- •Run the same prompt through a single
AssistantAgent. - •If OOM disappears, the issue is probably tool output size or group chat amplification.
- •Run the same prompt through a single
- •
Check backend limits
- •Verify model context window.
- •Verify GPU/CPU memory on local inference servers.
- •Check whether your provider is returning actual OOMs versus rate-limit or timeout errors masked as inference failures.
Prevention
- •Cap chat history aggressively.
- •Keep the last few turns and summarize older context.
- •Never pass raw large artifacts into the prompt.
- •Store documents in object storage or a vector store; pass references or summaries.
- •Match model size to workload.
- •Small models for short tasks; larger context models for document-heavy flows.
If you’re building production AutoGen workflows for banking or insurance, treat prompt size like any other resource limit. Add hard caps early, because once agents start chaining tools and preserving state, OOM bugs become intermittent and painful to reproduce.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit