How to Fix 'OOM error during inference in production' in AutoGen (TypeScript)

By Cyprian AaronsUpdated 2026-04-21

oom-error-during-inference-in-productionautogentypescript

What this error means

OOM error during inference in production usually means your agent process ran out of memory while building prompts, storing chat history, or handling model responses. In AutoGen TypeScript, this shows up most often after a few turns in a long-running conversation, or when you fan out many agents and keep every message in memory.

The failure is rarely “the model is too big” by itself. It’s usually your orchestration code retaining too much context, or sending oversized payloads into AssistantAgent, UserProxyAgent, or your model client.

The Most Common Cause

The #1 cause is unbounded conversation growth. In AutoGen, every turn can get appended to the message history, and if you keep reusing the same AssistantAgent instance without trimming context, token count and memory both climb until inference blows up.

Here’s the broken pattern:

import { AssistantAgent } from "@autogen/agentchat";
import { OpenAIChatCompletionClient } from "@autogen/openai";

const modelClient = new OpenAIChatCompletionClient({
  model: "gpt-4o-mini",
  apiKey: process.env.OPENAI_API_KEY!,
});

const assistant = new AssistantAgent({
  name: "support_agent",
  modelClient,
});

async function handleRequest(userInput: string) {
  // Broken: reuses the same agent forever
  const result = await assistant.run([
    { role: "user", content: userInput },
  ]);

  return result;
}

And here’s the fixed pattern:

Broken	Fixed
Reuse one long-lived agent with growing history	Create per-request state or explicitly trim history
Let messages accumulate forever	Cap context window and summarize old turns
Keep raw documents in chat messages	Store docs externally and pass only references

import { AssistantAgent } from "@autogen/agentchat";
import { OpenAIChatCompletionClient } from "@autogen/openai";

const modelClient = new OpenAIChatCompletionClient({
  model: "gpt-4o-mini",
  apiKey: process.env.OPENAI_API_KEY!,
});

async function handleRequest(userInput: string) {
  const assistant = new AssistantAgent({
    name: "support_agent",
    modelClient,
    // keep state local to the request
  });

  const result = await assistant.run([
    { role: "user", content: userInput.slice(0, 4000) },
  ]);

  return result;
}

If you need multi-turn memory, don’t keep everything. Summarize older turns and only send the last few messages plus a compact summary.

const trimmedHistory = [
  { role: "system", content: "Conversation summary: user wants refund status." },
  ...recentMessages.slice(-6),
];

Other Possible Causes

1) You are sending huge tool outputs back into the model

A common failure mode is dumping full database rows, PDFs, logs, or JSON blobs into an agent message. The model client then allocates memory for a giant prompt payload before inference even starts.

// Bad
await assistant.run([
  {
    role: "tool",
    content: JSON.stringify(bigResultSet), // can explode memory
  },
]);

// Better
await assistant.run([
  {
    role: "tool",
    content: JSON.stringify(bigResultSet.slice(0, 20)),
  },
]);

If you need full data, store it in S3, Blob Storage, or a database and pass a pointer:

{
  role: "tool",
  content: `Query result stored at s3://case-files/12345.json`,
}

2) Your max tokens are too high for production traffic

A large max_tokens setting can increase output buffers and push your worker over the edge when traffic spikes.

const modelClient = new OpenAIChatCompletionClient({
  model: "gpt-4o",
  apiKey: process.env.OPENAI_API_KEY!,
  maxTokens: 12000,
});

Use tighter limits unless you truly need long generation:

const modelClient = new OpenAIChatCompletionClient({
  model: "gpt-4o",
  apiKey: process.env.OPENAI_API_KEY!,
  maxTokens: 1024,
});

3) You are running too many concurrent agents

AutoGen makes it easy to spin up parallel conversations. If each request creates multiple AssistantAgent instances and they all hit the same worker at once, memory pressure climbs fast.

// Bad: unlimited concurrency
await Promise.all(requests.map((r) => processRequest(r)));

Throttle concurrency:

import pLimit from "p-limit";

const limit = pLimit(4);

await Promise.all(
  requests.map((r) => limit(() => processRequest(r)))
);

4) You are keeping large objects inside agent state

Don’t attach raw files, parsed documents, or giant arrays to custom agent wrappers. In Node.js services, this often looks harmless until the worker starts swapping or crashes with OOM.

class MyAgentState {
  documents = hugeParsedPdfArray; // bad
}

Keep state small:

class MyAgentState {
  documentIds = ["doc_123", "doc_456"];
}

How to Debug It

•
Check whether memory grows per request
- •Watch RSS and heap usage in production.
- •If memory rises steadily with each conversation, you have retained history or leaked objects.
•
Log prompt size before calling the model
- •Measure message count and total characters.
- •If one request is sending hundreds of KB or MB of text, that’s your problem.

const promptSize = messages.reduce((sum, m) => sum + m.content.length, 0);
console.log({ messageCount: messages.length, promptSize });

•
Disable tools and retry
- •If OOM disappears when tools are off, one tool is returning oversized payloads.
- •Start with file loaders, search tools, SQL tools, and web fetchers.
•
Reduce concurrency to one
- •If single-request mode works but production fails under load, this is capacity pressure.
- •Add a queue or rate limiter before scaling worker count.

Prevention

•
Keep agent context short.
- •Use summaries for old turns.
- •Cap recent messages to a small window.
•
Treat tool output as untrusted payload size.
- •Truncate large responses.
- •Persist big artifacts outside chat memory.
•
Put hard limits in code.

if (messages.length > 12) {
  throw new Error("Conversation too long for this worker");
}

If you’re seeing OOM error during inference in production in AutoGen TypeScript, start with context growth first. In most real systems I’ve seen, that’s the root cause more than anything else.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit