How to Fix 'OOM error during inference when scaling' in AutoGen (TypeScript)
When AutoGen throws OOM error during inference when scaling, it usually means your agent graph is creating more concurrent model work than the runtime can hold in memory. In TypeScript projects, this shows up most often when you scale from a single conversation to many parallel chats, tool calls, or nested agents.
The failure is rarely “the model is too big” by itself. It’s usually a concurrency, context growth, or response buffering problem inside your AutoGen setup.
The Most Common Cause
The #1 cause is unbounded parallelism: you spin up too many AssistantAgent runs at once, and each one holds prompt state, tool results, and streamed tokens in memory.
Here’s the broken pattern:
import { AssistantAgent } from "@autogen/core";
const agent = new AssistantAgent({
name: "support-agent",
modelClient,
});
const tickets = await getOpenTickets();
// Broken: fires all requests at once
const results = await Promise.all(
tickets.map((ticket) =>
agent.run(`Resolve ticket ${ticket.id}: ${ticket.summary}`)
)
);
And here’s the fixed pattern:
import pLimit from "p-limit";
import { AssistantAgent } from "@autogen/core";
const agent = new AssistantAgent({
name: "support-agent",
modelClient,
});
const limit = pLimit(3); // keep concurrency bounded
const tickets = await getOpenTickets();
const results = await Promise.all(
tickets.map((ticket) =>
limit(() => agent.run(`Resolve ticket ${ticket.id}: ${ticket.summary}`))
)
);
What changed:
- •
Promise.all()no longer floods the runtime with dozens of active inference calls. - •You cap concurrent model executions with
p-limit. - •Memory spikes stay predictable instead of growing with input size.
If you are using multiple agents, the same rule applies to GroupChatManager, nested AssistantAgent calls, and any orchestration layer that fans out work.
Other Possible Causes
1. Conversation history is growing without truncation
AutoGen keeps message history unless you explicitly trim it. Long-running sessions can blow up token usage and memory at the same time.
// Bad: unlimited history
await agent.run("Continue helping the user");
Fix it by trimming or summarizing older messages before each run:
const trimmedHistory = history.slice(-10);
await agent.run({
messages: trimmedHistory,
});
If your version supports it, use a summary step every N turns instead of keeping full transcripts forever.
2. Large tool outputs are being injected into the prompt
A common mistake is returning huge JSON payloads from tools and passing them straight back into the model.
// Bad: returns a massive object to the LLM
tools: [
{
name: "getPolicyData",
execute: async () => fetchAllPolicyRecords(),
},
];
Fix it by returning only what the agent needs:
tools: [
{
name: "getPolicyData",
execute: async () => {
const data = await fetchAllPolicyRecords();
return data.slice(0, 20).map(({ id, status }) => ({ id, status }));
},
},
];
Keep tool responses compact. If you need full records, store them outside the prompt and pass references.
3. Streaming responses are buffered in memory
Some setups accumulate streamed chunks before processing them. That defeats streaming and can trigger OOM under load.
// Bad: buffers everything
let fullText = "";
for await (const chunk of stream) {
fullText += chunk;
}
Prefer incremental handling:
for await (const chunk of stream) {
process.stdout.write(chunk);
}
If your wrapper library collects chunks internally, check whether it exposes an event-driven callback instead of returning one giant string.
4. Your model context window is too large for your workload
A large context window does not mean free memory. Bigger prompts cost more RAM and more KV cache during inference.
const agent = new AssistantAgent({
name: "claims-agent",
modelClient,
systemMessage: longPolicyManual,
});
Reduce prompt size:
const agent = new AssistantAgent({
name: "claims-agent",
modelClient,
systemMessage: "You are a claims assistant. Follow company policy.",
});
Move static policy docs into retrieval instead of pasting them into every prompt.
How to Debug It
- •
Check whether OOM appears only under concurrency
- •Run one request.
- •Then run five.
- •Then run fifty.
- •If failure starts with parallel load, you have a fan-out problem, not a single-request bug.
- •
Log prompt size and tool output size
- •Print message count.
- •Print serialized tool result length.
- •Print token estimates if your client exposes them.
- •If one request gets huge before crashing, history or tool payloads are the likely culprit.
- •
Disable streaming temporarily
- •If memory stabilizes without streaming, your buffering layer is holding too much state.
- •Check wrappers around
modelClient.stream()or custom adapters.
- •
Profile Node.js heap usage
- •Run with:
node --inspect --max-old-space-size=4096 dist/index.js - •Watch heap growth during bursts.
- •If heap climbs linearly with requests, something is retaining messages or responses.
- •Run with:
Prevention
- •
Cap concurrency everywhere you fan out AutoGen work:
- •
p-limit - •queue workers
- •per-user request throttles
- •
- •
Keep prompts small:
- •summarize old turns
- •truncate tool output
- •move documents to retrieval instead of inline context
- •
Treat every agent run as bounded work:
- •set max turns
- •set max tokens where supported
- •avoid recursive agent spawning without limits
If you’re seeing OOM error during inference when scaling in AutoGen TypeScript, start by hunting for Promise.all() over agent calls. In practice, that’s the root cause more often than anything else.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit