How to Fix 'OOM error during inference when scaling' in CrewAI (TypeScript)

By Cyprian AaronsUpdated 2026-04-21

oom-error-during-inference-when-scalingcrewaitypescript

What this error means

OOM error during inference when scaling usually means your CrewAI process is asking the model runtime for more memory than the container, node, or local machine can give it. In TypeScript projects, this shows up most often when you scale from one agent/task to many and keep loading large prompts, chat history, or tool outputs into every inference call.

The failure is usually not “CrewAI is broken.” It’s a memory pressure problem caused by prompt growth, parallel execution, or oversized context being sent to the model.

The Most Common Cause

The #1 cause is uncontrolled context accumulation inside agent/task loops.

A common anti-pattern is reusing one Agent instance with a growing message history, then fanning out many tasks at once. Each inference call gets bigger until the runtime throws something like:

•Error: OOM error during inference when scaling
•RuntimeError: CUDA out of memory
•Failed to create completion: context length exceeded
•AgentExecutionError: LLM inference failed

Broken vs fixed pattern

Broken pattern	Fixed pattern
Reuses one agent with full history for every task	Creates bounded context per task
Sends raw tool output back into the next prompt	Summarizes or truncates tool output
Runs too many tasks concurrently	Caps concurrency

// ❌ Broken: unbounded context growth
import { Agent, Task, Crew } from "crewai";

const analyst = new Agent({
  role: "Analyst",
  goal: "Analyze customer claims",
  backstory: "You work on insurance claims.",
});

const tasks = bigClaims.map((claim) =>
  new Task({
    description: `
      Analyze this claim:
      ${JSON.stringify(claim)}
      
      Previous analysis:
      ${allPreviousResults.join("\n")}
    `,
    agent: analyst,
  })
);

const crew = new Crew({
  agents: [analyst],
  tasks,
});

await crew.kickoff();

// ✅ Fixed: bounded context + per-task input trimming
import { Agent, Task, Crew } from "crewai";

const analyst = new Agent({
  role: "Analyst",
  goal: "Analyze customer claims",
  backstory: "You work on insurance claims.",
});

function trimClaim(claim: unknown) {
  const text = JSON.stringify(claim);
  return text.slice(0, 4000); // keep prompts bounded
}

const tasks = bigClaims.map((claim) =>
  new Task({
    description: `
      Analyze this claim:
      ${trimClaim(claim)}
      
      Return only:
      - risk score
      - key reasons
      - next action
    `,
    agent: analyst,
  })
);

const crew = new Crew({
  agents: [analyst],
  tasks,
});

await crew.kickoff();

If you are passing prior outputs forward, keep only a compact summary.

const summary = previousResult.slice(0, 1200);

Do not keep appending raw transcripts.

Other Possible Causes

1) Too much concurrency

If you run many agents/tasks in parallel, each one consumes its own memory budget. On smaller containers this explodes quickly.

// ❌ Too much parallelism
await Promise.all(tasks.map((task) => crew.runTask(task)));

// ✅ Limit concurrency
import pLimit from "p-limit";

const limit = pLimit(2);

await Promise.all(
  tasks.map((task) => limit(() => crew.runTask(task)))
);

2) Large tool outputs being injected into prompts

A tool that returns a full PDF, HTML page, or database dump will bloat the next inference request.

// ❌ Raw tool output goes straight into the prompt
const result = await fetchPolicyDocument();
task.description += `\n\nDocument:\n${result}`;

// ✅ Summarize before passing to the agent
const result = await fetchPolicyDocument();
task.description += `\n\nDocument summary:\n${result.slice(0, 2000)}`;

3) Model/context window mismatch

Some models have smaller context windows than people assume. If your prompt plus history exceeds it, you may see inference failures that look like OOM under load.

// ❌ Using a small-context model for long inputs
const agent = new Agent({
  role: "Reviewer",
  goal: "Review long claims files",
  llm: {
    model: "gpt-4o-mini",
    temperature: 0,
  },
});

// ✅ Use a model with enough context for the workload
const agent = new Agent({
  role: "Reviewer",
  goal: "Review long claims files",
  llm: {
    model: "gpt-4.1",
    temperature: 0,
  },
});

4) Node.js process memory ceiling

Sometimes the OOM is not the model at all. Your TypeScript service may be hitting Node’s heap limit while building huge arrays of prompts/results.

# Increase heap for local debugging
node --max-old-space-size=4096 dist/index.js

Also check whether your code stores every intermediate result in memory:

// ❌ Keeps all results forever
const results = [];
for (const task of tasks) {
  results.push(await crew.runTask(task));
}

Use streaming persistence instead:

// ✅ Persist each result and release memory
for (const task of tasks) {
  const result = await crew.runTask(task);
  await saveResult(result);
}

How to Debug It

•
Check where the spike happens
- •If memory grows before calling the LLM, it’s likely your TypeScript process.
- •If memory spikes during completion calls, it’s likely prompt size or concurrency.
•
Log prompt size per task
```
console.log("prompt chars:", task.description.length);
```
If some tasks are dramatically larger than others, you found the culprit.
•
Disable parallelism Run one task at a time. If the OOM disappears, concurrency is your problem.
•
Inspect tool outputs and history Look for:
- •full JSON blobs being appended repeatedly
- •long chat histories carried across agents
- •documents pasted directly into prompts

A practical check:

console.log({
  descriptionChars: task.description.length,
  historyItems: conversationHistory.length,
});

If those numbers keep climbing across iterations, you need truncation or summarization.

Prevention

•
Keep prompts bounded.
- •Truncate raw inputs.
- •Summarize tool output before reusing it.
•
Cap concurrency.
- •Start with 1-2 concurrent tasks and increase only after measuring memory.
•
Treat context as a budget.
- •Don’t pass full transcripts between agents unless there’s a hard reason.

If you want a stable CrewAI TypeScript setup in production, design for fixed-size inputs and predictable fan-out. Most “OOM error during inference when scaling” incidents are just uncontrolled prompt growth wearing a different label.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit