How to Fix 'connection timeout in production' in LangGraph (Python)
A connection timeout in production error in LangGraph usually means your graph is waiting on an external call that never returns before the network, load balancer, or client timeout kicks in. In practice, this shows up when a node calls an LLM, tool, database, or HTTP API and the request hangs long enough for the runtime to kill it.
In LangGraph Python apps, this is rarely a LangGraph bug. It’s usually a slow dependency, bad async handling, or a missing timeout on the underlying client.
The Most Common Cause
The #1 cause is blocking I/O inside an async LangGraph node.
If you define an async node but call a synchronous client inside it, you can stall the event loop. In production, that often surfaces as:
- •
httpx.ReadTimeout - •
openai.APITimeoutError - •
asyncio.TimeoutError - •upstream 504s from your reverse proxy
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
Uses sync client inside async def | Uses async client or offloads sync work properly |
| No explicit timeout | Sets timeouts at the client level |
| Can block the event loop | Keeps event loop responsive |
# BROKEN
from langgraph.graph import StateGraph, END
from typing import TypedDict
import openai
class State(TypedDict):
prompt: str
answer: str
client = openai.OpenAI() # sync client
async def call_model(state: State):
# This blocks the event loop in an async node
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": state["prompt"]}],
)
return {"answer": resp.choices[0].message.content}
graph = StateGraph(State)
graph.add_node("call_model", call_model)
graph.set_entry_point("call_model")
graph.add_edge("call_model", END)
app = graph.compile()
# FIXED
from langgraph.graph import StateGraph, END
from typing import TypedDict
from openai import AsyncOpenAI
import httpx
class State(TypedDict):
prompt: str
answer: str
client = AsyncOpenAI(
timeout=httpx.Timeout(20.0, connect=5.0),
)
async def call_model(state: State):
resp = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": state["prompt"]}],
)
return {"answer": resp.choices[0].message.content}
graph = StateGraph(State)
graph.add_node("call_model", call_model)
graph.set_entry_point("call_model")
graph.add_edge("call_model", END)
app = graph.compile()
If you must use a sync library, run it in a thread pool instead of calling it directly from async def.
Other Possible Causes
1) Your tool call has no timeout
A common failure mode is a tool that waits forever on a database or HTTP request.
# BAD: no timeout
import requests
def fetch_customer(customer_id: str):
return requests.get(f"https://api.example.com/customers/{customer_id}").json()
# GOOD: explicit timeout
import requests
def fetch_customer(customer_id: str):
resp = requests.get(
f"https://api.example.com/customers/{customer_id}",
timeout=(5, 20), # connect, read
)
resp.raise_for_status()
return resp.json()
2) Your deployment platform times out before LangGraph finishes
If you’re behind Nginx, ALB, Cloud Run, Vercel, or an API gateway, the platform may cut the request off first.
# Example Nginx config
proxy_connect_timeout 5s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
send_timeout 30s;
If your graph regularly takes 45 seconds and your proxy kills requests at 30 seconds, LangGraph never gets to finish.
3) Streaming is enabled but your consumer stops reading
With streaming runs like app.astream(...), if the client disconnects or stops consuming chunks, the server side can appear stuck until socket cleanup happens.
# Make sure you fully consume the stream
async for chunk in app.astream(input_state):
print(chunk)
Also check browser clients and gateways that buffer responses instead of passing chunks through.
4) You have recursive graph logic or runaway retries
A cycle with no proper stop condition can look like a timeout in production.
# BAD: retry loop without a hard cap
def should_retry(state):
return True
Use explicit counters in state and bail out after a small number of attempts.
def should_retry(state):
return state["attempts"] < 3
How to Debug It
- •
Find the exact failing layer
- •Check whether the error is
httpx.ReadTimeout,asyncio.TimeoutError,openai.APITimeoutError, or an upstream504 Gateway Timeout. - •If it’s a gateway timeout, LangGraph may be innocent; your infra is cutting the request off.
- •Check whether the error is
- •
Time each node
- •Wrap nodes with timing logs.
- •Log start/end timestamps around every tool and model call.
- •The slowest node is usually obvious within one run.
- •
Remove one external dependency at a time
- •Replace LLM calls with static returns.
- •Replace DB/tool calls with mocks.
- •If the timeout disappears after removing one dependency, that’s your culprit.
- •
Check sync code inside async nodes
- •Look for
requests, sync SDKs, file I/O, and blocking ORM calls insideasync def. - •In LangGraph Python apps using
StateGraph, this is one of the most common production mistakes.
- •Look for
Prevention
- •
Set timeouts everywhere:
- •HTTP clients
- •LLM SDKs
- •DB queries
- •reverse proxies and ingress controllers
- •
Keep async nodes truly async:
- •use async SDKs where available
- •don’t block the event loop with sync network calls
- •
Add hard limits to graph execution:
- •max retries per node
- •max recursion depth / loop count
- •per-request deadlines propagated through state or context
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit