How to Fix 'cold start latency when scaling' in LlamaIndex (Python)
When people say they’re seeing “cold start latency when scaling” in LlamaIndex, they usually mean the first request after a worker spins up is slow enough to miss an SLO. In practice, this shows up when your app creates indexes, loads models, or rebuilds vector stores inside the request path.
The pattern is simple: it works fine on a warm process, then falls apart when autoscaling adds new pods, new Uvicorn workers, or fresh Lambda invocations.
The Most Common Cause
The #1 cause is initializing heavy LlamaIndex objects inside the request handler instead of at startup or behind a shared cache.
That means things like VectorStoreIndex.from_documents(), StorageContext.from_defaults(), Settings.embed_model = ..., or even load_index_from_storage() are getting called per request. On a cold worker, that turns every first hit into a mini bootstrap job.
Broken vs fixed pattern
| Broken | Fixed |
|---|---|
| Build index on every request | Build once at startup |
| Recreate embeddings/model clients repeatedly | Reuse global/shared instances |
| Cold workers pay full load cost | Warm workers serve queries immediately |
# broken.py
from fastapi import FastAPI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
app = FastAPI()
@app.get("/search")
def search(q: str):
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs) # expensive on every request
query_engine = index.as_query_engine()
return {"answer": str(query_engine.query(q))}
# fixed.py
from fastapi import FastAPI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
app = FastAPI()
index = None
@app.on_event("startup")
def startup():
global index
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs) # build once per process
@app.get("/search")
def search(q: str):
query_engine = index.as_query_engine()
return {"answer": str(query_engine.query(q))}
If you’re using persisted storage, the better pattern is to load from disk or object storage during startup:
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
That avoids recomputing embeddings and rebuilding the index on every scale event.
Other Possible Causes
1) Embedding model initialization is happening too late
If your embedding model client is created lazily during the first query, you’ll see a spike on the first request to each new worker.
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
# do this at startup, not inside the handler
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
If you set Settings.embed_model inside a route handler, each cold worker pays client setup cost and sometimes network auth overhead too.
2) You’re using SimpleDirectoryReader in production hot paths
SimpleDirectoryReader is fine for local development. It is not fine if you call it per request against a large directory tree.
docs = SimpleDirectoryReader("./data").load_data()
Fix it by preloading documents at startup or by persisting the parsed index:
# one-time ingestion job
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist("./storage")
3) Your vector store is remote and connection setup is expensive
Pinecone, Qdrant, Weaviate, Postgres pgvector — all of them can add cold-start latency if you create clients repeatedly.
from qdrant_client import QdrantClient
client = QdrantClient(url="http://qdrant:6333") # create once
If that client gets instantiated inside your route handler or dependency factory per request, scaling will amplify connection setup time and auth handshakes.
4) You’re running multiple Uvicorn/Gunicorn workers without warming them
Each worker has its own memory space. If you scale from 1 to 8 workers, each one may need to load the index independently.
gunicorn app:app -k uvicorn.workers.UvicornWorker --workers 8
That’s not wrong, but it means your startup path must be cheap. If you see logs like:
- •
Loading index from storage... - •
Embedding documents... - •
Building index...
during first traffic on each worker, that’s your cold start tax.
How to Debug It
- •
Time your startup separately from your request path
Add timestamps around index loading and query execution.import time t0 = time.time() # load index here print("startup seconds:", time.time() - t0) - •
Check whether initialization happens per request
Search forVectorStoreIndex.from_documents,load_index_from_storage,SimpleDirectoryReader, and embedding client creation inside route handlers or dependencies. - •
Inspect logs for repeated LlamaIndex build messages
Common signs:- •
ValueError: No existing data found in storage context - •
RuntimeError: Failed to initialize embedding model - •repeated “loading” messages on every first hit after scaling
- •
- •
Measure worker-level cold starts
Hit each pod/worker directly. If only the first request to each instance is slow, your problem is process-level initialization, not query logic.
Prevention
- •Build indexes in an offline ingestion job, then persist them with
storage_context.persist(). - •Load shared resources at app startup, not in route handlers.
- •Keep embedding and vector store clients as long-lived singletons per process.
- •Add a warmup endpoint if your platform scales aggressively:
@app.get("/healthz") def healthz(): return {"ok": True}
If you want stable latency under autoscaling, treat LlamaIndex setup like database migrations: do it once, outside the hot path. The query endpoint should read from prebuilt state, not assemble it under load.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit