How to Fix 'OOM error during inference' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inferenceautogenpython

When AutoGen throws OOM error during inference, it means the model backend ran out of memory while processing a prompt, tool call, or response generation. In practice, this usually shows up when you send too much conversation history, use a model with a small context window, or let agents accumulate state across many turns.

The failure is often not in AutoGen itself. It’s usually the combination of AssistantAgent, long chat history, large tool outputs, and an undersized model/runtime.

The Most Common Cause

The #1 cause is unbounded conversation growth. In AutoGen, people often keep appending messages to the same messages list or reuse the same agent chat state across multiple runs, which causes inference payloads to balloon until the backend crashes.

Here’s the broken pattern versus the fixed pattern:

Broken patternFixed pattern
Reuse full history foreverTrim history or summarize before inference
Pass large tool output directly into chatStore it externally and pass only a reference
Let AssistantAgent see every prior turnReset or cap context per task
# BROKEN: full history keeps growing
from autogen import AssistantAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config={
        "config_list": [{"model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"]}]
    },
)

messages = []

for i in range(100):
    messages.append({"role": "user", "content": f"Turn {i}: analyze this report..."})
    reply = assistant.generate_reply(messages=messages)
    messages.append({"role": "assistant", "content": reply})
# FIXED: cap history and keep only recent context
from autogen import AssistantAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config={
        "config_list": [{"model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"]}]
    },
)

def trim_messages(messages, keep_last=8):
    return messages[-keep_last:]

messages = []

for i in range(100):
    messages.append({"role": "user", "content": f"Turn {i}: analyze this report..."})
    trimmed = trim_messages(messages)
    reply = assistant.generate_reply(messages=trimmed)
    messages.append({"role": "assistant", "content": reply})

If you’re using GroupChat or ConversableAgent, the same issue applies. The agent framework will happily carry forward everything unless you explicitly limit it.

Other Possible Causes

1) Tool output is too large

A common failure mode is returning raw PDFs, logs, HTML pages, or giant JSON blobs from a tool and feeding them straight back into AssistantAgent.

# BAD: massive tool result goes directly into context
def fetch_claim_history(claim_id):
    return open(f"/tmp/{claim_id}.json").read()

tool_result = fetch_claim_history("CLM-123")
messages.append({"role": "tool", "content": tool_result})

Fix it by summarizing or storing the payload elsewhere.

# GOOD: store raw data externally, pass a summary
tool_result = fetch_claim_history("CLM-123")
summary = summarize_claim_history(tool_result)  # your own summarizer
messages.append({"role": "tool", "content": summary})

2) Model context window is too small

If you’re using a smaller model, even moderate prompts can trigger memory pressure. This often shows up as backend errors like:

  • OOM error during inference
  • CUDA out of memory
  • RuntimeError: out of memory
  • provider-specific 500/503 failures during generation

Example config that can be too tight:

llm_config = {
    "config_list": [
        {"model": "llama3-8b", "api_key": "..."}
    ],
    "temperature": 0.2,
}

Use a model with more headroom or reduce prompt size.

llm_config = {
    "config_list": [
        {"model": "gpt-4o-mini", "api_key": "..."}
    ],
    "temperature": 0.2,
}

3) You are sending oversized system prompts

Some teams pack policy docs, product manuals, and workflow rules into one giant system message. That works until it doesn’t.

system_prompt = open("all_company_policies.txt").read()
assistant = AssistantAgent(
    name="assistant",
    system_message=system_prompt,
    llm_config=llm_config,
)

Break it up and only inject what’s needed for the task.

assistant = AssistantAgent(
    name="assistant",
    system_message="You are an insurance claims triage assistant.",
    llm_config=llm_config,
)

4) Parallel agents are competing for memory

With GroupChat, multiple agents can generate long responses at once. If each one has large context plus tool output, you multiply memory usage fast.

from autogen import GroupChat, GroupChatManager

groupchat = GroupChat(agents=[agent1, agent2, agent3], messages=[], max_round=20)
manager = GroupChatManager(groupchat=groupchat, llm_config=llm_config)

Reduce rounds and constrain each agent’s prompt size.

groupchat = GroupChat(agents=[agent1, agent2], messages=[], max_round=6)
manager = GroupChatManager(groupchat=groupchat, llm_config=llm_config)

How to Debug It

  1. Reproduce with minimal input

    • Start with one user message and no tools.
    • If it works, add history back incrementally.
    • If it fails immediately, your model/backend config is likely the problem.
  2. Log message sizes

    • Print token counts or approximate character lengths before every inference call.
    • Watch for one huge tool response or a runaway conversation buffer.
for m in messages:
    print(m["role"], len(m["content"]))
  1. Disable tools and multi-agent flow

    • Run the same prompt through a single AssistantAgent.
    • If OOM disappears, the issue is probably tool output size or group chat amplification.
  2. Check backend limits

    • Verify model context window.
    • Verify GPU/CPU memory on local inference servers.
    • Check whether your provider is returning actual OOMs versus rate-limit or timeout errors masked as inference failures.

Prevention

  • Cap chat history aggressively.
    • Keep the last few turns and summarize older context.
  • Never pass raw large artifacts into the prompt.
    • Store documents in object storage or a vector store; pass references or summaries.
  • Match model size to workload.
    • Small models for short tasks; larger context models for document-heavy flows.

If you’re building production AutoGen workflows for banking or insurance, treat prompt size like any other resource limit. Add hard caps early, because once agents start chaining tools and preserving state, OOM bugs become intermittent and painful to reproduce.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides