How to Fix 'OOM error during inference' in AutoGen (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inferenceautogentypescript

What the error means

OOM error during inference usually means your process ran out of memory while AutoGen was trying to generate a response, stream tokens, or keep too much conversation state in RAM. In TypeScript projects, it often shows up after a few turns in a chat loop, when you pass huge messages into an agent, or when you keep every tool result and model output forever.

The important part: this is usually not an AutoGen bug. It’s almost always a state-management problem in your app.

The Most Common Cause

The #1 cause is unbounded chat history. With AssistantAgent, UserProxyAgent, or any custom agent loop, developers often keep appending every message, tool output, and intermediate step into the next inference call.

That works for 2–3 turns. Then the prompt grows until the model provider or your runtime throws something like:

  • OOM error during inference
  • Error: Request too large
  • RangeError: Invalid string length
  • JavaScript heap out of memory

Broken vs fixed pattern

Broken patternFixed pattern
Reuse the full message array foreverTrim history before each inference
Store raw tool outputs in chatSummarize or extract only needed fields
Pass large documents directly into every turnChunk and retrieve only relevant context
// BROKEN: chat history grows without bound
import { AssistantAgent } from "@autogen/agent";

const agent = new AssistantAgent({
  name: "support-agent",
  modelClient,
});

const messages: any[] = [];

async function runTurn(userText: string) {
  messages.push({ role: "user", content: userText });

  const result = await agent.run(messages);

  // This keeps every assistant/tool message forever
  messages.push(...result.messages);
}
// FIXED: keep only the last N messages or summarize older context
import { AssistantAgent } from "@autogen/agent";

const agent = new AssistantAgent({
  name: "support-agent",
  modelClient,
});

type ChatMessage = { role: "user" | "assistant" | "tool"; content: string };

let messages: ChatMessage[] = [];
const MAX_TURNS = 8;

function trimHistory(history: ChatMessage[]) {
  return history.slice(-MAX_TURNS * 2); // rough cap for user+assistant pairs
}

async function runTurn(userText: string) {
  messages.push({ role: "user", content: userText });

  const result = await agent.run(trimHistory(messages));

  messages.push(
    ...result.messages.map((m) => ({
      role: m.role as ChatMessage["role"],
      content: String(m.content),
    }))
  );

  messages = trimHistory(messages);
}

If your workflow needs long context, don’t stuff it all into chat history. Summarize old turns, store details externally, and retrieve only what matters for the current request.

Other Possible Causes

1) Huge tool outputs

A tool returning a full database dump, PDF text, or API payload can blow up memory fast.

// BAD
const toolResult = await fetch("https://api.example.com/export").then(r => r.text());
messages.push({ role: "tool", content: toolResult });

Fix it by truncating or extracting fields before sending back to the agent.

// GOOD
const raw = await fetch("https://api.example.com/export").then(r => r.json());

messages.push({
  role: "tool",
  content: JSON.stringify({
    idCount: raw.items.length,
    topItems: raw.items.slice(0, 10),
  }),
});

2) Large files loaded into memory at once

If you read PDFs, logs, or transcripts with fs.readFileSync, you can spike heap usage before inference even starts.

// BAD
const transcript = fs.readFileSync("./big-transcript.txt", "utf8");

Prefer streaming, chunking, or pre-indexing.

// GOOD
const chunks = splitIntoChunks(await loadTextInChunks("./big-transcript.txt"));

3) Too many concurrent runs

Running many agent.run() calls in parallel can exhaust memory even if each request is small.

// BAD
await Promise.all(jobs.map(job => agent.run(job.messages)));

Throttle concurrency.

// GOOD
for (const job of jobs) {
  await agent.run(job.messages);
}

Or use a queue with a hard concurrency limit like p-limit.

4) Model context window mismatch

Sometimes the issue is not RAM but token explosion. If you send more tokens than the model can accept, providers may fail in ways that look like memory issues.

const agent = new AssistantAgent({
  name: "legal-agent",
  modelClient,
  // Use a model with enough context for your workload
});

Check your provider’s context window and keep prompts below it with margin.

How to Debug It

  1. Log prompt size before every call

    • Print message count and approximate character length.
    • If size keeps climbing across turns, you found the bug.
  2. Isolate the last successful payload

    • Remove tools first.
    • Then remove long documents.
    • Then reduce history to one turn.
    • The smallest failing payload tells you which input is exploding memory.
  3. Watch process memory

    • In Node.js, log process.memoryUsage() before and after agent.run().
    • If heap jumps sharply on one request, inspect that request’s inputs and tool outputs.
  4. Test with capped history

    • Hard-limit to the last 4–8 turns.
    • If the OOM disappears immediately, your problem is conversation growth, not the model client.
console.log({
  rssMB: Math.round(process.memoryUsage().rss / 1024 / 1024),
  heapUsedMB: Math.round(process.memoryUsage().heapUsed / 1024 / 1024),
  messageCount: messages.length,
});

Prevention

  • Cap chat history from day one.
  • Summarize old turns instead of keeping raw transcripts.
  • Treat tool outputs as untrusted payloads; return only what the next step needs.
  • Put concurrency limits on batch agent runs.
  • Add memory logging around every AssistantAgent.run() call in staging before shipping to production.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides