How to Fix 'cold start latency when scaling' in LangGraph (Python)
What this error usually means
cold start latency when scaling in LangGraph usually shows up when a graph worker has to initialize too much work on the first request after a scale-up. In practice, that means your pod, process, or serverless instance is spending time loading models, building graph objects, opening DB connections, or compiling schemas before it can answer.
You’ll see this most often in Kubernetes, autoscaled API workers, Lambda-style deployments, or any setup where LangGraph runs behind a process manager that spins up new replicas on demand.
The Most Common Cause
The #1 cause is building heavy objects inside the request path instead of at module load time or in a long-lived startup hook. In LangGraph apps, this usually means creating the StateGraph, LLM client, vector store, retriever, or database connection inside the handler that executes per request.
That pattern works locally. Under scale-out, it creates repeated cold starts because every new worker repeats the same expensive initialization.
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Builds graph and clients on every request | Builds once and reuses across requests |
| High first-request latency after scaling | Lower warm-path latency |
| Harder to observe and cache | Easier to instrument and stabilize |
# BROKEN: expensive setup happens inside the request handler
from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI
def handle_request(user_input: str):
llm = ChatOpenAI(model="gpt-4o-mini") # created every time
graph = StateGraph(dict) # created every time
def call_model(state):
return {"answer": llm.invoke(state["input"]).content}
graph.add_node("call_model", call_model)
graph.set_entry_point("call_model")
app = graph.compile()
return app.invoke({"input": user_input})
# FIXED: build once at import/startup time
from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
def call_model(state):
return {"answer": llm.invoke(state["input"]).content}
graph = StateGraph(dict)
graph.add_node("call_model", call_model)
graph.set_entry_point("call_model")
app = graph.compile()
def handle_request(user_input: str):
return app.invoke({"input": user_input})
If you’re using FastAPI, put the compiled app in startup state or module scope. If you’re using workers like Gunicorn/Uvicorn, make sure each worker initializes once, not per endpoint call.
Other Possible Causes
1) Lazy loading a model or embedding index on first node execution
If your first LangGraph node loads a model from disk or initializes an embedding index, scale-out will expose that latency immediately.
# expensive lazy init inside node
def retrieve(state):
index = load_faiss_index("/mnt/index.faiss")
return {"docs": index.search(state["query"])}
Fix it by loading once:
index = load_faiss_index("/mnt/index.faiss")
def retrieve(state):
return {"docs": index.search(state["query"])}
2) Recompiling the graph repeatedly
StateGraph.compile() is not something you want in the hot path. If you compile per request or per tenant without caching, every new replica pays that cost again.
# bad
def get_app():
graph = build_graph()
return graph.compile()
Use one compiled instance per process:
# good
app = build_graph().compile()
If you need tenant-specific behavior, cache compiled graphs by tenant key instead of rebuilding them blindly.
3) Slow startup dependencies: DB pools, secrets fetches, remote config
A worker may look like it has “cold start latency” when the real issue is initialization blocked on Postgres, Redis, Vault, S3 config files, or secret managers.
# startup path blocked by remote calls
settings = fetch_remote_settings()
db = create_engine(settings.db_url)
redis = Redis.from_url(settings.redis_url)
Move those calls into startup hooks and add timeouts. If your platform supports prewarming, use it.
4) Too much work in __init__ for custom nodes/tools
Custom node classes sometimes hide expensive setup in constructors. That makes scaling painful because every worker recreates them.
class MyTool:
def __init__(self):
self.client = build_huge_client()
self.cache = load_cache_from_disk()
Prefer dependency injection:
class MyTool:
def __init__(self, client, cache):
self.client = client
self.cache = cache
Then create those dependencies once during app startup.
How to Debug It
- •
Measure startup vs request latency separately
- •Add timing around app creation and around
app.invoke(). - •If
compile()or dependency setup is slow, you’ve found the bottleneck.
- •Add timing around app creation and around
- •
Check whether latency spikes only on new replicas
- •If only the first request after autoscaling is slow, this is a cold-start problem.
- •If every request is slow, look at node logic or external services instead.
- •
Log initialization points
- •Add logs around model creation, DB connection setup, retriever loading, and
StateGraph.compile(). - •You want to know exactly which line runs during scale-up.
- •Add logs around model creation, DB connection setup, retriever loading, and
- •
Profile one worker from boot to first response
- •Use
py-spy,cProfile, or simple wall-clock logging. - •Focus on imports and constructors before chasing LangGraph internals.
- •Use
Prevention
- •Build graphs and clients at module scope or in explicit startup hooks.
- •Cache compiled graphs and heavy resources per process.
- •Keep node functions thin: no model loading, no file I/O, no network setup inside the hot path.
- •Add startup timing metrics so you catch regressions before autoscaling exposes them.
If you want one rule to remember: in LangGraph Python apps, treat compile() and client initialization as deployment-time work, not request-time work. That’s usually enough to eliminate cold start latency when scaling.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit