How to Fix 'intermittent 500 errors when scaling' in CrewAI (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
intermittent-500-errors-when-scalingcrewaitypescript

What this error usually means

intermittent 500 errors when scaling in CrewAI TypeScript usually means your agent pipeline works at low concurrency, then starts failing once you add parallel requests, more workers, or a bigger batch size. The failures are often not deterministic because they come from shared state, rate limits, or request fan-out under load.

In practice, you’ll see things like Internal Server Error, Failed to execute crew, or a wrapped provider error coming out of Crew.kickoff() after traffic increases.

The Most Common Cause

The #1 cause is shared mutable state inside agents, tasks, or tools. In TypeScript, people often create one Crew, one Agent, or one tool instance and reuse it across concurrent requests with per-request data stuffed into object properties.

That works until two requests overlap and overwrite each other.

Broken vs fixed pattern

Broken patternFixed pattern
Reuse the same mutable agent/tool instance across requestsCreate request-scoped instances inside the handler
Store request data on class fieldsPass data through task inputs/context
Share one client with hidden stateUse stateless clients or clone per request
// BROKEN
import { Crew, Agent, Task } from "crewai";

const agent = new Agent({
  role: "Analyst",
  goal: "Summarize customer cases",
});

const crew = new Crew({
  agents: [agent],
  tasks: [],
});

class CaseTool {
  customerId = "";

  async execute() {
    return `Processing ${this.customerId}`;
  }
}

const tool = new CaseTool();

export async function handler(req: Request) {
  const body = await req.json();
  tool.customerId = body.customerId; // shared mutable state

  const task = new Task({
    description: "Summarize the case",
    agent,
    tools: [tool],
  });

  crew.tasks = [task];
  return await crew.kickoff(); // intermittent 500s under load
}
// FIXED
import { Crew, Agent, Task } from "crewai";

class CaseTool {
  constructor(private readonly customerId: string) {}

  async execute() {
    return `Processing ${this.customerId}`;
  }
}

export async function handler(req: Request) {
  const body = await req.json();

  const agent = new Agent({
    role: "Analyst",
    goal: "Summarize customer cases",
  });

  const tool = new CaseTool(body.customerId);

  const task = new Task({
    description: `Summarize the case for customer ${body.customerId}`,
    agent,
    tools: [tool],
  });

  const crew = new Crew({
    agents: [agent],
    tasks: [task],
  });

  return await crew.kickoff();
}

If you’re using serverless functions, this matters even more. Warm instances keep module-level objects alive between invocations, so “global” state becomes cross-request state.

Other Possible Causes

1. Rate limiting from your LLM provider

When you scale requests, provider throttling often shows up as random-looking server failures.

// Example symptom in logs
Error: OpenAI API error: Rate limit reached for gpt-4o

Fix it with retries and backoff around the model call path, not around the entire HTTP request.

const delay = (ms: number) => new Promise(res => setTimeout(res, ms));

async function withRetry<T>(fn: () => Promise<T>, attempts = 3): Promise<T> {
  let lastErr: unknown;
  for (let i = 0; i < attempts; i++) {
    try {
      return await fn();
    } catch (err) {
      lastErr = err;
      await delay(200 * Math.pow(2, i));
    }
  }
  throw lastErr;
}

2. Non-idempotent side effects inside tasks

If a task writes to a DB, sends a webhook, or mutates a queue before the run completes, retries can create duplicate work and downstream failures.

// BAD
await db.cases.update({ id }, { status: "processed" });
await crew.kickoff();

Move side effects to a post-success step and make them idempotent.

const result = await crew.kickoff();
await db.cases.upsert({
  where: { id },
  update: { status: "processed" },
});

3. Too much parallelism in your own code

People wrap crew.kickoff() in Promise.all() and accidentally create an internal thundering herd.

// BAD
await Promise.all(
  items.map(item => processItem(item))
);

Throttle concurrency.

import pLimit from "p-limit";

const limit = pLimit(5);
await Promise.all(items.map(item => limit(() => processItem(item))));

4. Bad timeout configuration

A request may not fail immediately; it may hit your platform timeout first and surface as a generic 500.

// Example config smell
export const maxDuration = 10;

Increase timeouts where appropriate and set explicit time budgets around model calls.

const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30000);

try {
  await fetch(url, { signal: controller.signal });
} finally {
  clearTimeout(timeout);
}

How to Debug It

  1. Check whether failures correlate with concurrency

    • Run the same input once.
    • Then run it with 10, 50, and 100 parallel requests.
    • If failure rate rises with load, suspect shared state or rate limits first.
  2. Log request-scoped identifiers

    • Add requestId, customerId, and taskId to every log line.
    • If logs from one request appear in another request’s execution path, you have object reuse problems.
  3. Inspect wrapped errors from CrewAI

    • Don’t stop at Internal Server Error.
    • Print the full stack and nested cause:
    try {
      await crew.kickoff();
    } catch (err) {
      console.error("CrewAI kickoff failed", err);
      throw err;
    }
    
    • Look for provider messages like:
      • Rate limit reached
      • context length exceeded
      • Cannot read properties of undefined
      • Failed to execute crew
  4. Disable everything except one task

    • Remove tools.
    • Remove memory.
    • Remove parallel branches.
    • If the error disappears, reintroduce components one by one until it returns.

Prevention

  • Keep agents, crews, and tools request-scoped unless they are truly stateless.
  • Add concurrency limits at the API edge and around internal batch jobs.
  • Treat LLM calls as unreliable I/O:
    • retries with backoff
    • timeouts
    • idempotent downstream writes

If you’re seeing intermittent 500 errors only after scaling CrewAI TypeScript workloads, start by hunting shared mutable state. In most real systems that’s the bug hiding behind everything else.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides