How to Fix 'cold start latency when scaling' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-22
cold-start-latency-when-scalingllamaindexpython

When people say they’re seeing “cold start latency when scaling” in LlamaIndex, they usually mean the first request after a worker spins up is slow enough to miss an SLO. In practice, this shows up when your app creates indexes, loads models, or rebuilds vector stores inside the request path.

The pattern is simple: it works fine on a warm process, then falls apart when autoscaling adds new pods, new Uvicorn workers, or fresh Lambda invocations.

The Most Common Cause

The #1 cause is initializing heavy LlamaIndex objects inside the request handler instead of at startup or behind a shared cache.

That means things like VectorStoreIndex.from_documents(), StorageContext.from_defaults(), Settings.embed_model = ..., or even load_index_from_storage() are getting called per request. On a cold worker, that turns every first hit into a mini bootstrap job.

Broken vs fixed pattern

BrokenFixed
Build index on every requestBuild once at startup
Recreate embeddings/model clients repeatedlyReuse global/shared instances
Cold workers pay full load costWarm workers serve queries immediately
# broken.py
from fastapi import FastAPI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

app = FastAPI()

@app.get("/search")
def search(q: str):
    docs = SimpleDirectoryReader("./data").load_data()
    index = VectorStoreIndex.from_documents(docs)  # expensive on every request
    query_engine = index.as_query_engine()
    return {"answer": str(query_engine.query(q))}
# fixed.py
from fastapi import FastAPI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

app = FastAPI()
index = None

@app.on_event("startup")
def startup():
    global index
    docs = SimpleDirectoryReader("./data").load_data()
    index = VectorStoreIndex.from_documents(docs)  # build once per process

@app.get("/search")
def search(q: str):
    query_engine = index.as_query_engine()
    return {"answer": str(query_engine.query(q))}

If you’re using persisted storage, the better pattern is to load from disk or object storage during startup:

from llama_index.core import StorageContext, load_index_from_storage

storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

That avoids recomputing embeddings and rebuilding the index on every scale event.

Other Possible Causes

1) Embedding model initialization is happening too late

If your embedding model client is created lazily during the first query, you’ll see a spike on the first request to each new worker.

from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding

# do this at startup, not inside the handler
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

If you set Settings.embed_model inside a route handler, each cold worker pays client setup cost and sometimes network auth overhead too.

2) You’re using SimpleDirectoryReader in production hot paths

SimpleDirectoryReader is fine for local development. It is not fine if you call it per request against a large directory tree.

docs = SimpleDirectoryReader("./data").load_data()

Fix it by preloading documents at startup or by persisting the parsed index:

# one-time ingestion job
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist("./storage")

3) Your vector store is remote and connection setup is expensive

Pinecone, Qdrant, Weaviate, Postgres pgvector — all of them can add cold-start latency if you create clients repeatedly.

from qdrant_client import QdrantClient

client = QdrantClient(url="http://qdrant:6333")  # create once

If that client gets instantiated inside your route handler or dependency factory per request, scaling will amplify connection setup time and auth handshakes.

4) You’re running multiple Uvicorn/Gunicorn workers without warming them

Each worker has its own memory space. If you scale from 1 to 8 workers, each one may need to load the index independently.

gunicorn app:app -k uvicorn.workers.UvicornWorker --workers 8

That’s not wrong, but it means your startup path must be cheap. If you see logs like:

  • Loading index from storage...
  • Embedding documents...
  • Building index...

during first traffic on each worker, that’s your cold start tax.

How to Debug It

  1. Time your startup separately from your request path
    Add timestamps around index loading and query execution.

    import time
    
    t0 = time.time()
    # load index here
    print("startup seconds:", time.time() - t0)
    
  2. Check whether initialization happens per request
    Search for VectorStoreIndex.from_documents, load_index_from_storage, SimpleDirectoryReader, and embedding client creation inside route handlers or dependencies.

  3. Inspect logs for repeated LlamaIndex build messages
    Common signs:

    • ValueError: No existing data found in storage context
    • RuntimeError: Failed to initialize embedding model
    • repeated “loading” messages on every first hit after scaling
  4. Measure worker-level cold starts
    Hit each pod/worker directly. If only the first request to each instance is slow, your problem is process-level initialization, not query logic.

Prevention

  • Build indexes in an offline ingestion job, then persist them with storage_context.persist().
  • Load shared resources at app startup, not in route handlers.
  • Keep embedding and vector store clients as long-lived singletons per process.
  • Add a warmup endpoint if your platform scales aggressively:
    @app.get("/healthz")
    def healthz():
        return {"ok": True}
    

If you want stable latency under autoscaling, treat LlamaIndex setup like database migrations: do it once, outside the hot path. The query endpoint should read from prebuilt state, not assemble it under load.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides