How to Fix 'connection timeout when scaling' in CrewAI (TypeScript)
When CrewAI throws connection timeout when scaling, it usually means your agent runtime tried to spin up more workers, but the underlying network call never completed in time. In TypeScript projects, this often shows up during bursts of parallel task execution, remote tool calls, or when the process is waiting on an external API that is already near its timeout limit.
In practice, this is rarely a “CrewAI bug” in isolation. It’s usually a configuration problem: too much concurrency, a slow upstream service, or a client timeout that is shorter than the work being done.
The Most Common Cause
The #1 cause is unbounded concurrency during task scaling.
A lot of TypeScript code looks fine in local testing, then fails under load because every task is fired at once. CrewAI tries to scale workers, but your API client, proxy, or upstream service can’t keep up, and you get errors like:
- •
Error: connection timeout when scaling - •
TimeoutError: Request timed out while scaling workers - •
CrewAIError: Failed to initialize worker pool
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Fires all tasks at once | Limits concurrency |
| No timeout control | Explicit timeout + retry |
| No backpressure | Queue-based execution |
// Broken: unbounded parallelism
import { Crew } from "crewai";
const crew = new Crew({
agents,
tasks,
});
const results = await Promise.all(
tasks.map((task) => crew.execute(task))
);
// Fixed: bounded concurrency with explicit timeout
import pLimit from "p-limit";
import { Crew } from "crewai";
const crew = new Crew({
agents,
tasks,
});
const limit = pLimit(3); // keep scaling under control
const results = await Promise.all(
tasks.map((task) =>
limit(() =>
Promise.race([
crew.execute(task),
new Promise((_, reject) =>
setTimeout(() => reject(new Error("Task timeout")), 30000)
),
])
)
)
);
If you’re using a CrewAI wrapper that exposes worker scaling options, set them explicitly instead of letting defaults do the work:
const crew = new Crew({
agents,
tasks,
maxWorkers: 3,
scaleTimeoutMs: 30000,
});
Other Possible Causes
1) Upstream API timeout is shorter than CrewAI’s scale window
If your LLM provider or tool API times out at 10 seconds but your crew waits 30 seconds for scaling, you’ll see connection failures that look like infrastructure issues.
// Bad: upstream client times out too early
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
timeout: 10000,
});
// Better: align client timeout with expected task duration
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
timeout: 60000,
});
2) Too many retries causing socket exhaustion
Retries are good until they stampede. If each failed worker spawn retries instantly, you can exhaust sockets and trigger connection timeout when scaling.
// Bad: aggressive retry loop
for (let i = 0; i < 5; i++) {
await crew.execute(task);
}
// Better: exponential backoff
import pRetry from "p-retry";
await pRetry(() => crew.execute(task), {
retries: 3,
minTimeout: 1000,
factor: 2,
});
3) DNS / proxy / VPC networking issue
This happens a lot in enterprise environments. The code works on localhost but fails inside a container, VPN, or private subnet.
# Check whether your runtime can reach the provider endpoint
curl -v https://api.openai.com/v1/models
If you’re behind a proxy:
export HTTPS_PROXY=http://proxy.internal:8080
export HTTP_PROXY=http://proxy.internal:8080
Also verify your container has outbound access to:
- •LLM provider endpoints
- •tool APIs
- •any internal service used by tools
4) Event loop blocked by CPU-heavy work
If your task handler does heavy synchronous work before or during scaling, the Node.js event loop gets stuck and network calls time out.
// Bad: synchronous CPU work blocks scaling requests
function expensiveParse(data: string) {
while (true) {
if (data.length > 1000000) break;
}
}
Move heavy computation off the main thread or into a separate job queue.
How to Debug It
- •
Check whether the failure happens at high concurrency
- •Run the same workflow with one task.
- •Then try
2,3,5concurrent tasks. - •If it only fails above a threshold, you have a scaling/concurrency problem.
- •
Log exact timings around worker creation
- •Measure how long
crew.execute()takes before failure. - •Separate “task start”, “worker init”, and “external API call” timings.
- •Measure how long
const start = Date.now();
try {
await crew.execute(task);
} catch (err) {
console.error("Failed after ms:", Date.now() - start);
console.error(err);
}
- •
Inspect upstream client config
- •Check request timeout.
- •Check retry count.
- •Check whether keep-alive is enabled.
- •Make sure your LLM/tool client isn’t silently timing out earlier than CrewAI.
- •
Test network path outside CrewAI
- •Call the same endpoint with
curlor a minimal Node script. - •If that fails too, stop debugging CrewAI and fix networking first.
- •Call the same endpoint with
Prevention
- •
Set explicit limits:
- •
maxWorkers - •request timeouts
- •retry policy with backoff
- •
- •
Avoid firing every task through
Promise.all()unless you’ve bounded concurrency. - •
Add startup checks for:
- •outbound network access
- •proxy settings
- •provider latency under load
If you’re building production agent systems in TypeScript, treat scaling like any other distributed systems problem. Most connection timeout when scaling errors are not mysterious; they’re just uncontrolled parallelism meeting real network limits.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit