How to Fix 'intermittent 500 errors in production' in CrewAI (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
intermittent-500-errors-in-productioncrewaitypescript

Intermittent 500 errors in production usually mean your CrewAI workflow is failing under real load, not in local dev. In TypeScript, this often shows up when an agent call works sometimes, then blows up on retries, parallel requests, or slightly different inputs.

The key thing: a 500 from CrewAI is usually the symptom, not the root cause. The root cause is often bad tool wiring, non-deterministic state, rate limits, or unhandled exceptions bubbling out of an agent/task execution.

The Most Common Cause

The #1 cause I see is throwing raw exceptions inside tools or task handlers, then letting them escape the CrewAI runtime. In production, one bad request payload or one flaky downstream API call turns into an intermittent server error.

Here’s the broken pattern:

import { Agent, Task, Crew } from "crewai";

const fetchPolicyTool = {
  name: "fetch_policy",
  description: "Fetch policy details",
  execute: async (policyId: string) => {
    const res = await fetch(`https://api.internal/policies/${policyId}`);
    if (!res.ok) {
      throw new Error(`Policy API failed with ${res.status}`);
    }
    return await res.json();
  },
};

const agent = new Agent({
  name: "SupportAgent",
  role: "Customer support assistant",
  goal: "Resolve policy questions",
  tools: [fetchPolicyTool],
});

const task = new Task({
  description: "Look up policy {policyId} and summarize coverage",
  agent,
});

const crew = new Crew({
  agents: [agent],
  tasks: [task],
});

await crew.kickoff({ policyId: "" });

And the fixed pattern:

import { Agent, Task, Crew } from "crewai";

class PolicyLookupError extends Error {
  constructor(message: string) {
    super(message);
    this.name = "PolicyLookupError";
  }
}

const fetchPolicyTool = {
  name: "fetch_policy",
  description: "Fetch policy details",
  execute: async (policyId: string) => {
    if (!policyId?.trim()) {
      return { ok: false, error: "policyId is required" };
    }

    try {
      const res = await fetch(`https://api.internal/policies/${encodeURIComponent(policyId)}`);

      if (!res.ok) {
        return {
          ok: false,
          error: `Policy API returned ${res.status}`,
        };
      }

      return { ok: true, data: await res.json() };
    } catch (err) {
      return {
        ok: false,
        error:
          err instanceof Error ? err.message : "Unknown network failure",
      };
    }
  },
};

const agent = new Agent({
  name: "SupportAgent",
  role: "Customer support assistant",
  goal: "Resolve policy questions",
  tools: [fetchPolicyTool],
});

const task = new Task({
  description:
    "Look up policy {policyId}. If lookup fails, explain the failure and stop.",
  agent,
});

const crew = new Crew({
  agents: [agent],
  tasks: [task],
});

await crew.kickoff({ policyId: "POL-12345" });

Why this matters:

  • Raw throws become unstable runtime failures.
  • Returning structured errors lets the agent handle failures deterministically.
  • Empty or malformed inputs should fail fast before hitting external APIs.

Other Possible Causes

CauseWhat it looks likeFix
Missing env varsTypeError: Cannot read properties of undefined or a generic 500Validate config at startup
Rate limits / upstream timeoutsWorks locally, fails under trafficAdd retries with backoff
Shared mutable stateFails only when multiple requests run togetherMake tool state request-scoped
Bad model/tool output parsingJSON.parse errors or malformed task resultsValidate and sanitize model output

Missing environment variables

A common production-only issue is deploying without required keys.

// broken
const apiKey = process.env.POLICY_API_KEY!;
// fixed
function requireEnv(name: string): string {
  const value = process.env[name];
  if (!value) throw new Error(`Missing required env var ${name}`);
  return value;
}

const apiKey = requireEnv("POLICY_API_KEY");

Rate limits and upstream timeouts

If your tool calls a downstream service, transient failures will surface as intermittent 500s.

// fixed snippet
async function withRetry<T>(fn: () => Promise<T>, attempts = 3): Promise<T> {
  let lastErr: unknown;

  for (let i = 0; i < attempts; i++) {
    try {
      return await fn();
    } catch (err) {
      lastErr = err;
      await new Promise((r) => setTimeout(r, Math.pow(2, i) * 200));
    }
  }

  throw lastErr;
}

Shared mutable state

This breaks under concurrent requests.

// broken
let lastPolicyId = "";

const tool = {
  name: "track_policy",
  execute: async (policyId: string) => {
    lastPolicyId = policyId;
    return { policyIdUsedByToolLaterMaybeWrongly: lastPolicyId };
  },
};

Use request-local values only. Do not store per-request data in module globals.

Bad output parsing

CrewAI tasks often fail when you assume perfect JSON from an LLM.

// broken
const parsed = JSON.parse(result);
// fixed
function safeParseJson(input: string) {
  try {
    return { ok: true as const, data: JSON.parse(input) };
  } catch {
    return { ok: false as const, error: "Invalid JSON from model" };
  }
}

How to Debug It

  1. Check the exact stack trace

    • Look for the first non-CrewAI frame.
    • If you see Error, TypeError, or SyntaxError from your tool code, that’s usually the real source.
    • Common messages include:
      • CrewAI task execution failed
      • Unhandled error in tool execution
      • 500 Internal Server Error
  2. Log inputs at the edge

    • Print task inputs before calling crew.kickoff().
    • Validate required fields like IDs, emails, dates, and tenant IDs.
    • Most “intermittent” issues are actually bad payloads making it through sometimes.
  3. Wrap every external dependency

    • Add logging around HTTP calls, DB reads, and queue calls.
    • Capture status codes and response bodies for non-2xx responses.
    • If the failure disappears when you mock a dependency locally, that dependency is your culprit.
  4. Remove concurrency temporarily

    • Run one request at a time.
    • Disable parallel task execution if you have it.
    • If failures stop, you likely have shared state or a race condition in a tool wrapper.

Prevention

  • Validate all task inputs before calling CrewAI.
  • Never throw raw errors from tools unless you intentionally want the whole run to fail.
  • Keep tools stateless and request-scoped.
  • Add retries only around transient dependencies like HTTP APIs and queues.
  • Log structured errors with request IDs so you can trace one production failure end to end.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides