How to Fix 'deployment crash when scaling' in CrewAI (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
deployment-crash-when-scalingcrewaitypescript

When CrewAI deployments crash only after you scale replicas or workers, it usually means the agent process is carrying state it should not carry. In practice, the crash shows up when multiple instances start at once and hit shared resources like memory, file handles, Redis keys, model clients, or non-serializable objects.

The pattern is almost always the same: it works locally with one process, then falls over in Kubernetes, ECS, or a Node cluster when concurrency increases.

The Most Common Cause

The #1 cause is sharing mutable singleton state across scaled instances. In CrewAI TypeScript projects, this usually means creating Agent, Task, Crew, or tool instances at module scope and reusing them across requests.

That breaks when the runtime clones workers or restarts pods because the object graph may include open connections, cached callbacks, or request-specific context. The result is often a crash like:

  • Error: Cannot read properties of undefined (reading 'run')
  • TypeError: Converting circular structure to JSON
  • CrewAIError: Failed to serialize task state
  • Worker exited with code 1

Broken vs fixed pattern

BrokenFixed
Creates CrewAI objects once at import timeCreates fresh instances per request/job
Reuses mutable tool/client state across workersUses factory functions and dependency injection
Assumes single-process behaviorAssumes horizontal scaling from day one
// ❌ Broken: shared singleton state
import { Agent, Task, Crew } from "@crewai/typescript";
import { OpenAI } from "openai";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const agent = new Agent({
  role: "Support analyst",
  goal: "Summarize tickets",
  llm: client,
});

const task = new Task({
  description: "Summarize the latest support ticket",
  agent,
});

const crew = new Crew({
  agents: [agent],
  tasks: [task],
});

export async function handleRequest() {
  return await crew.run();
}
// ✅ Fixed: create per-request instances
import { Agent, Task, Crew } from "@crewai/typescript";
import { OpenAI } from "openai";

function buildCrew() {
  const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

  const agent = new Agent({
    role: "Support analyst",
    goal: "Summarize tickets",
    llm: client,
  });

  const task = new Task({
    description: "Summarize the latest support ticket",
    agent,
  });

  return new Crew({
    agents: [agent],
    tasks: [task],
  });
}

export async function handleRequest() {
  const crew = buildCrew();
  return await crew.run();
}

If you are using a web server or queue consumer, treat CrewAI objects as request-scoped, not app-singletons.

Other Possible Causes

1) Non-serializable tool output

If a tool returns a class instance, stream, buffer, circular object, or database connection, scaling can expose serialization failures.

// ❌ Broken
const toolResult = {
  user: prisma.user.findUnique({ where: { id } }), // Promise or complex object
};
// ✅ Fixed
const user = await prisma.user.findUnique({ where: { id } });
return {
  id: user?.id,
  email: user?.email,
};

A common runtime symptom is:

  • TypeError: Converting circular structure to JSON
  • CrewAIError: Tool output must be serializable

2) Missing env vars in scaled pods

One pod may have OPENAI_API_KEY, another may not. That gives you intermittent crashes that look like app instability.

# ❌ Broken deployment snippet
env:
  - name: NODE_ENV
    value: production
# ✅ Fixed deployment snippet
env:
  - name: NODE_ENV
    valueFrom:
      configMapKeyRef:
        name: app-config
        key: NODE_ENV
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: llm-secrets
        key: OPENAI_API_KEY

Typical error messages:

  • OpenAI API key not found
  • CrewAIError: Missing required environment variable OPENAI_API_KEY

3) Race conditions on shared files or temp paths

If every replica writes to /tmp/output.json or the same local file path, workers overwrite each other.

// ❌ Broken
await fs.writeFile("/tmp/output.json", JSON.stringify(result));
// ✅ Fixed
await fs.writeFile(`/tmp/output-${requestId}.json`, JSON.stringify(result));

This gets worse if your container filesystem is ephemeral and you expect persistence across restarts.

4) Unbounded concurrency inside the agent loop

Scaling replicas plus high internal concurrency can overwhelm memory or rate limits.

// ❌ Broken
await Promise.all(jobs.map((job) => crew.run(job)));
// ✅ Fixed
import pLimit from "p-limit";

const limit = pLimit(2);
await Promise.all(jobs.map((job) => limit(() => crew.run(job))));

Symptoms include:

  • Worker terminated unexpectedly
  • OOMKilled
  • provider throttling errors from your LLM SDK

How to Debug It

  1. Check whether the crash happens only after replica count increases

    • Run with one replica first.
    • If it passes at replicas=1 and fails at replicas=2+, suspect shared state or race conditions.
  2. Inspect the exact stack trace

    • Look for lines mentioning:
      • Agent
      • Task
      • Crew
      • tool callbacks
      • serialization functions like JSON.stringify
    • A stack trace that points into startup code usually means module-scope initialization is failing.
  3. Log object construction boundaries

    • Add logs around crew creation and execution.
    • If you see one object reused across many requests, that’s your bug.
console.log("building crew", requestId);
const crew = buildCrew();
console.log("running crew", requestId);
await crew.run();
  1. Temporarily disable parallelism
    • Run one worker.
    • Remove Promise.all.
    • Replace shared caches with plain local variables.
    • If the crash disappears, you’ve confirmed a concurrency bug.

Prevention

  • Build CrewAI agents and crews through factory functions; do not keep them as long-lived singletons unless they are fully stateless.
  • Make every tool return plain JSON-safe data only.
  • Store secrets in deployment config and validate them at startup before serving traffic.
  • Test with the same scaling model you use in production:
    • multiple pods
    • multiple Node workers
    • concurrent requests

If you want a simple rule to follow: anything inside CrewAI that touches network clients, memory caches, files, or request context should be created per job. That removes most “deployment crash when scaling” failures before they reach production.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides