How to Fix 'connection timeout in production' in LangGraph (TypeScript)
What this error means
connection timeout in production in LangGraph usually means your graph started fine locally, but the runtime in production could not reach a dependency within the default timeout window. In practice, that’s often OpenAI, Anthropic, a vector DB, Postgres, Redis, or your own HTTP tool call hanging long enough to fail the request.
If you’re seeing this only after deployment, assume it’s not a LangGraph bug first. It’s usually network latency, bad runtime config, or an agent node doing blocking work that never returns fast enough.
The Most Common Cause
The #1 cause is a node that makes an external call without a timeout, retry policy, or proper async handling. In TypeScript LangGraph apps, this often happens inside a RunnableLambda, a tool function, or a custom node that calls fetch()/SDK clients directly.
Here’s the broken pattern and the fixed pattern side by side.
| Broken | Fixed |
|---|---|
| ```ts | |
| import { StateGraph } from "@langchain/langgraph"; |
const graph = new StateGraph({ channels: { messages: { value: (x: any[]) => x, default: () => [], }, }, });
graph.addNode("lookupCustomer", async (state) => { // No timeout. If this hangs in prod, the whole run stalls. const res = await fetch("https://api.internal.company.com/customer", { method: "POST", body: JSON.stringify({ id: state.customerId }), });
return { customer: await res.json() };
});
|ts
import { StateGraph } from "@langchain/langgraph";
const withTimeout = async (url: string, init: RequestInit, ms = 5000) => { const controller = new AbortController(); const id = setTimeout(() => controller.abort(), ms);
try { const res = await fetch(url, { ...init, signal: controller.signal, });
if (!res.ok) {
throw new Error(`HTTP ${res.status} from ${url}`);
}
return await res.json();
} finally { clearTimeout(id); } };
const graph = new StateGraph({ channels: { messages: { value: (x: any[]) => x, default: () => [], }, }, });
graph.addNode("lookupCustomer", async (state) => { const customer = await withTimeout( "https://api.internal.company.com/customer", { method: "POST", headers: { "content-type": "application/json" }, body: JSON.stringify({ id: state.customerId }), }, 5000 );
return { customer }; });
In production, the failure usually surfaces as one of these:
- `Error: connection timeout in production`
- `AbortError: The operation was aborted`
- `LangGraphError: Failed to execute node "lookupCustomer"`
- A wrapped provider error like `OpenAI API request timed out`
If you’re calling an LLM inside the node, make sure the model client also has explicit timeout settings. Don’t rely on defaults.
```ts
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
timeout: 5000,
maxRetries: 2,
});
Other Possible Causes
1) Cold starts or serverless timeouts
If your LangGraph app runs on Vercel, Lambda, Cloud Run, or similar platforms, the process may be killed before the graph finishes.
export const maxDuration = 30; // platform-specific
If your graph does multi-step retrieval plus tool calls plus LLM generation, that can exceed the platform limit fast.
2) Bad environment variables in production
A missing endpoint or wrong base URL can look like a timeout because the client keeps retrying against nowhere.
const baseUrl = process.env.INTERNAL_API_URL;
if (!baseUrl) throw new Error("INTERNAL_API_URL is required");
Common mistakes:
- •Using
localhostin production - •Pointing to a private VPC address from public serverless code
- •Forgetting region-specific endpoints
3) Database pool exhaustion
LangGraph nodes often fan out requests. If each run opens a fresh DB connection, production traffic will exhaust the pool and requests will stall.
// Bad
await prisma.$connect(); // inside every request/node
// Good
// create one shared client at module scope
const prisma = new PrismaClient();
For Postgres libraries:
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 10,
connectionTimeoutMillis: 5000,
});
4) Tool functions doing blocking work
A tool that hits another service without guardrails can block the whole graph execution.
const tools = [
async function searchPolicyDocs(query: string) {
const res = await fetch(`${process.env.SEARCH_URL}/query?q=${encodeURIComponent(query)}`);
return await res.text();
},
];
Fix it with timeouts and bounded retries:
async function searchPolicyDocs(query: string) {
return withTimeout(
`${process.env.SEARCH_URL}/query?q=${encodeURIComponent(query)}`,
{},
3000
);
}
How to Debug It
- •
Find the exact failing node
- •Log node entry and exit.
- •In LangGraph terms, identify whether it fails in
StateGraph.addNode(...), a tool call, or during model invocation. - •If you see
Failed to execute node "<name>", start there.
- •
Measure each external call
- •Wrap every
fetch, SDK call, and DB query with timing logs. - •Example:
const start = Date.now(); await someCall(); console.log("someCall took", Date.now() - start);
- •Wrap every
- •
Check prod-only networking
- •Verify DNS resolution, firewall rules, private subnets, and outbound egress.
- •Test the same URL from inside the deployed runtime if possible.
- •
Reduce concurrency
- •If you use parallel branches or multiple tools per turn, cut them down temporarily.
- •A timeout that disappears when concurrency drops usually points to pool exhaustion or downstream rate limiting.
Prevention
- •Put explicit timeouts on every outbound dependency:
- •LLM clients
- •HTTP calls
- •DB queries
- •Keep graph nodes small and deterministic.
- •Add structured logs for:
- •node name
- •latency
- •upstream status code
- •In serverless deployments:
- •set platform execution limits consciously
- •avoid cold-start-heavy initialization inside request paths
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit