How to Fix 'intermittent 500 errors in production' in CrewAI (TypeScript)
Intermittent 500 errors in production usually mean your CrewAI workflow is failing under real load, not in local dev. In TypeScript, this often shows up when an agent call works sometimes, then blows up on retries, parallel requests, or slightly different inputs.
The key thing: a 500 from CrewAI is usually the symptom, not the root cause. The root cause is often bad tool wiring, non-deterministic state, rate limits, or unhandled exceptions bubbling out of an agent/task execution.
The Most Common Cause
The #1 cause I see is throwing raw exceptions inside tools or task handlers, then letting them escape the CrewAI runtime. In production, one bad request payload or one flaky downstream API call turns into an intermittent server error.
Here’s the broken pattern:
import { Agent, Task, Crew } from "crewai";
const fetchPolicyTool = {
name: "fetch_policy",
description: "Fetch policy details",
execute: async (policyId: string) => {
const res = await fetch(`https://api.internal/policies/${policyId}`);
if (!res.ok) {
throw new Error(`Policy API failed with ${res.status}`);
}
return await res.json();
},
};
const agent = new Agent({
name: "SupportAgent",
role: "Customer support assistant",
goal: "Resolve policy questions",
tools: [fetchPolicyTool],
});
const task = new Task({
description: "Look up policy {policyId} and summarize coverage",
agent,
});
const crew = new Crew({
agents: [agent],
tasks: [task],
});
await crew.kickoff({ policyId: "" });
And the fixed pattern:
import { Agent, Task, Crew } from "crewai";
class PolicyLookupError extends Error {
constructor(message: string) {
super(message);
this.name = "PolicyLookupError";
}
}
const fetchPolicyTool = {
name: "fetch_policy",
description: "Fetch policy details",
execute: async (policyId: string) => {
if (!policyId?.trim()) {
return { ok: false, error: "policyId is required" };
}
try {
const res = await fetch(`https://api.internal/policies/${encodeURIComponent(policyId)}`);
if (!res.ok) {
return {
ok: false,
error: `Policy API returned ${res.status}`,
};
}
return { ok: true, data: await res.json() };
} catch (err) {
return {
ok: false,
error:
err instanceof Error ? err.message : "Unknown network failure",
};
}
},
};
const agent = new Agent({
name: "SupportAgent",
role: "Customer support assistant",
goal: "Resolve policy questions",
tools: [fetchPolicyTool],
});
const task = new Task({
description:
"Look up policy {policyId}. If lookup fails, explain the failure and stop.",
agent,
});
const crew = new Crew({
agents: [agent],
tasks: [task],
});
await crew.kickoff({ policyId: "POL-12345" });
Why this matters:
- •Raw throws become unstable runtime failures.
- •Returning structured errors lets the agent handle failures deterministically.
- •Empty or malformed inputs should fail fast before hitting external APIs.
Other Possible Causes
| Cause | What it looks like | Fix |
|---|---|---|
| Missing env vars | TypeError: Cannot read properties of undefined or a generic 500 | Validate config at startup |
| Rate limits / upstream timeouts | Works locally, fails under traffic | Add retries with backoff |
| Shared mutable state | Fails only when multiple requests run together | Make tool state request-scoped |
| Bad model/tool output parsing | JSON.parse errors or malformed task results | Validate and sanitize model output |
Missing environment variables
A common production-only issue is deploying without required keys.
// broken
const apiKey = process.env.POLICY_API_KEY!;
// fixed
function requireEnv(name: string): string {
const value = process.env[name];
if (!value) throw new Error(`Missing required env var ${name}`);
return value;
}
const apiKey = requireEnv("POLICY_API_KEY");
Rate limits and upstream timeouts
If your tool calls a downstream service, transient failures will surface as intermittent 500s.
// fixed snippet
async function withRetry<T>(fn: () => Promise<T>, attempts = 3): Promise<T> {
let lastErr: unknown;
for (let i = 0; i < attempts; i++) {
try {
return await fn();
} catch (err) {
lastErr = err;
await new Promise((r) => setTimeout(r, Math.pow(2, i) * 200));
}
}
throw lastErr;
}
Shared mutable state
This breaks under concurrent requests.
// broken
let lastPolicyId = "";
const tool = {
name: "track_policy",
execute: async (policyId: string) => {
lastPolicyId = policyId;
return { policyIdUsedByToolLaterMaybeWrongly: lastPolicyId };
},
};
Use request-local values only. Do not store per-request data in module globals.
Bad output parsing
CrewAI tasks often fail when you assume perfect JSON from an LLM.
// broken
const parsed = JSON.parse(result);
// fixed
function safeParseJson(input: string) {
try {
return { ok: true as const, data: JSON.parse(input) };
} catch {
return { ok: false as const, error: "Invalid JSON from model" };
}
}
How to Debug It
- •
Check the exact stack trace
- •Look for the first non-CrewAI frame.
- •If you see
Error,TypeError, orSyntaxErrorfrom your tool code, that’s usually the real source. - •Common messages include:
- •
CrewAI task execution failed - •
Unhandled error in tool execution - •
500 Internal Server Error
- •
- •
Log inputs at the edge
- •Print task inputs before calling
crew.kickoff(). - •Validate required fields like IDs, emails, dates, and tenant IDs.
- •Most “intermittent” issues are actually bad payloads making it through sometimes.
- •Print task inputs before calling
- •
Wrap every external dependency
- •Add logging around HTTP calls, DB reads, and queue calls.
- •Capture status codes and response bodies for non-2xx responses.
- •If the failure disappears when you mock a dependency locally, that dependency is your culprit.
- •
Remove concurrency temporarily
- •Run one request at a time.
- •Disable parallel task execution if you have it.
- •If failures stop, you likely have shared state or a race condition in a tool wrapper.
Prevention
- •Validate all task inputs before calling CrewAI.
- •Never throw raw errors from tools unless you intentionally want the whole run to fail.
- •Keep tools stateless and request-scoped.
- •Add retries only around transient dependencies like HTTP APIs and queues.
- •Log structured errors with request IDs so you can trace one production failure end to end.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit