How to Fix 'connection timeout in production' in AutoGen (TypeScript)
A connection timeout in production error in AutoGen TypeScript usually means your agent tried to call an LLM, tool, or remote service and never got a response back before the network stack gave up. In practice, this shows up when your production runtime has stricter egress rules, shorter timeouts, bad DNS, or an API endpoint that is reachable locally but not from the deployed environment.
The key thing: this is usually not an AutoGen logic bug. It’s almost always a networking or configuration problem around OpenAIChatCompletionClient, your tool calls, or the infrastructure sitting between your agent and the model provider.
The Most Common Cause
The #1 cause I see is using default HTTP timeouts that are fine on localhost but too short for production traffic, cold starts, or slower model responses.
In AutoGen TypeScript, the failure often surfaces as something like:
- •
Error: request timed out - •
Connection timeout - •
FetchError: network timeout at: https://api.openai.com/... - •
OpenAIChatCompletionClientrequest hanging until your server kills it
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Uses default client settings with no explicit timeout | Sets a production-safe timeout and retries |
| Lets serverless platform kill the request first | Aligns app timeout with infra timeout |
| No observability around outbound calls | Logs request duration and failure reason |
// ❌ Broken
import { OpenAIChatCompletionClient } from "@autogen/openai";
const modelClient = new OpenAIChatCompletionClient({
model: "gpt-4o-mini",
apiKey: process.env.OPENAI_API_KEY,
});
// ✅ Fixed
import { OpenAIChatCompletionClient } from "@autogen/openai";
const modelClient = new OpenAIChatCompletionClient({
model: "gpt-4o-mini",
apiKey: process.env.OPENAI_API_KEY,
// Use explicit production-friendly limits
timeout: 60_000,
maxRetries: 3,
});
If you’re running behind a platform like Vercel, Cloud Run, ECS, or a corporate proxy, also make sure the platform timeout is longer than the model call. If your app times out at 30 seconds and your LLM client waits 60 seconds, you’ll still fail.
Other Possible Causes
1) Wrong outbound network policy in production
Your local machine has open internet access. Your production VPC may not.
# Example Kubernetes NetworkPolicy issue
egress:
- to:
- namespaceSelector:
matchLabels:
name: internal-only
If api.openai.com isn’t allowed through NAT or egress rules, you’ll get timeouts instead of clean HTTP errors.
2) Missing or bad DNS resolution
This looks like a timeout, but the real issue is name resolution failing inside the container.
nslookup api.openai.com
curl -v https://api.openai.com/v1/models
If DNS is broken in your pod or container runtime, AutoGen never reaches the provider.
3) Proxy misconfiguration
A proxy can work for browser traffic and fail for Node.js fetch requests.
HTTP_PROXY=http://proxy.internal:8080
HTTPS_PROXY=http://proxy.internal:8080
NO_PROXY=localhost,127.0.0.1,.internal
If your proxy requires auth or TLS inspection, Node may hang until timeout unless the proxy is configured correctly for server-side requests.
4) Tool call hangs inside your agent loop
Sometimes AutoGen is fine; one of your tools is not.
const slowTool = async () => {
// ❌ Hangs forever on external API call
return await fetch("https://internal-api.example.com/report");
};
If a tool never returns, the whole agent run can look like an LLM connection problem. Put timeouts around every external call.
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), 10_000);
try {
const res = await fetch("https://internal-api.example.com/report", {
signal: controller.signal,
});
return await res.json();
} finally {
clearTimeout(timer);
}
How to Debug It
- •
Confirm whether it’s the model call or a tool call
- •Temporarily disable tools and run a single chat completion.
- •If
OpenAIChatCompletionClientstill times out, it’s networking/model access. - •If only agent runs with tools fail, inspect each tool individually.
- •
Test connectivity from the same runtime
- •SSH into the box, exec into the pod, or open a shell in the container.
- •Run:
curl -v https://api.openai.com/v1/models - •If this hangs or fails, AutoGen is not the root cause.
- •
Add timing logs around every outbound request
- •Measure start/end time for model calls and tool calls.
- •Log whether you hit:
- •DNS lookup delay
- •TLS handshake delay
- •first byte delay
- •full response delay
- •
Check platform timeouts and retry behavior
- •Compare:
- •app/serverless timeout
- •reverse proxy timeout
- •load balancer idle timeout
- •AutoGen client timeout
- •The smallest one wins.
- •Compare:
Prevention
- •
Set explicit timeouts on every external dependency:
- •LLM client
- •HTTP tools
- •database queries that feed agent context
- •
Treat outbound network access as part of deployment validation:
- •test DNS
- •test egress/NAT
- •test proxy auth if present
- •
Add structured logs for agent runs:
- •request ID
- •model name
- •tool name
- •duration
- •error class/message
If you’re seeing connection timeout in production with AutoGen, don’t start by rewriting your agent logic. Start by checking egress, DNS, proxies, and timeouts around OpenAIChatCompletionClient and any custom tools. In most cases, one of those four is the real failure point.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit