How to Fix 'cold start latency during development' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-22
cold-start-latency-during-developmentllamaindexpython

What this error usually means

cold start latency during development in LlamaIndex is not usually a single Python exception. It’s the symptom you see when your first query takes too long because indexes, embeddings, or LLM clients are being initialized on demand instead of ahead of time.

It shows up most often during local development, notebook work, or API startup when VectorStoreIndex.from_documents(...), embedding model loading, or remote model calls happen inside the request path.

The Most Common Cause

The #1 cause is rebuilding your index on every request.

That means you are calling VectorStoreIndex.from_documents() inside a function that runs per request, so LlamaIndex re-reads documents, re-chunks them, and re-embeds them every time. The first call feels like a “cold start,” and every subsequent call is still expensive if the process restarts often.

Broken patternFixed pattern
Build index inside request handlerBuild once at startup and reuse
Recompute embeddings every callPersist index to disk or vector store
Instantiate Settings.llm repeatedlyConfigure once globally
# BROKEN
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

def answer_question(question: str):
    documents = SimpleDirectoryReader("./data").load_data()
    index = VectorStoreIndex.from_documents(documents)
    query_engine = index.as_query_engine()
    return query_engine.query(question)

# Every request reloads docs and rebuilds embeddings.
# FIXED
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage
from pathlib import Path

PERSIST_DIR = "./storage"

def build_or_load_index():
    if Path(PERSIST_DIR).exists():
        storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
        return load_index_from_storage(storage_context)

    documents = SimpleDirectoryReader("./data").load_data()
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir=PERSIST_DIR)
    return index

index = build_or_load_index()
query_engine = index.as_query_engine()

def answer_question(question: str):
    return query_engine.query(question)

If you are using FastAPI, Flask, or a background worker, keep the index in module scope or initialize it during app startup. Do not hide it inside the route handler unless you want repeated cold starts.

Other Possible Causes

1) Embedding model initialization is happening lazily

If you use OpenAI embeddings or another provider without warming the client up, the first embedding call can be slow.

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

If this is created inside a request path, move it to app startup.

2) You are hitting an LLM provider with no connection reuse

Creating a new client for every query increases latency. This is common with OpenAI, AzureOpenAI, or local HTTP-based models wrapped through LlamaIndex.

from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

Set it once. If you create OpenAI(...) per request, you pay setup cost every time.

3) Your vector store is remote and not warmed up

When using Pinecone, Qdrant, Weaviate, or Postgres-backed stores, the first query may pay network and connection setup costs.

from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

client = QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(client=client, collection_name="docs")

Keep the client global and avoid recreating it in handlers.

4) You are running under auto-reload during development

Framework reloaders restart the process on file changes. That means your “warm” objects disappear and every change looks like a cold start.

uvicorn app:app --reload

That is fine for development, but expect slower first requests after each code change. If startup time matters during debugging, test without reload:

uvicorn app:app

How to Debug It

  1. Measure where time is spent Add timing around document loading, indexing, embedding, and querying.

    import time
    
    start = time.perf_counter()
    documents = SimpleDirectoryReader("./data").load_data()
    print("load_data:", time.perf_counter() - start)
    
    start = time.perf_counter()
    index = VectorStoreIndex.from_documents(documents)
    print("from_documents:", time.perf_counter() - start)
    
  2. Check whether indexing happens more than once Add a log line before from_documents(...). If it prints on every request, that’s your problem.

  3. Verify persistence If you expect a cached index but still see slow starts, confirm your storage directory exists and contains persisted state.

    from pathlib import Path
    print(Path("./storage").exists())
    
  4. Isolate provider latency Temporarily swap your LLM/embedding model for a local stub or mock. If latency disappears, the bottleneck is external API initialization or network calls.

Prevention

  • Build indexes at startup and persist them with StorageContext.persist(...).
  • Configure Settings.llm and Settings.embed_model once in one module.
  • Avoid putting SimpleDirectoryReader(...).load_data() or VectorStoreIndex.from_documents(...) inside request handlers.
  • Treat --reload as a dev convenience, not a performance baseline.

If you want predictable startup behavior in LlamaIndex Python apps, make initialization explicit. Cold starts are almost always self-inflicted by lazy setup in the wrong place.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides