How to Fix 'cold start latency during development' in LlamaIndex (Python)
What this error usually means
cold start latency during development in LlamaIndex is not usually a single Python exception. It’s the symptom you see when your first query takes too long because indexes, embeddings, or LLM clients are being initialized on demand instead of ahead of time.
It shows up most often during local development, notebook work, or API startup when VectorStoreIndex.from_documents(...), embedding model loading, or remote model calls happen inside the request path.
The Most Common Cause
The #1 cause is rebuilding your index on every request.
That means you are calling VectorStoreIndex.from_documents() inside a function that runs per request, so LlamaIndex re-reads documents, re-chunks them, and re-embeds them every time. The first call feels like a “cold start,” and every subsequent call is still expensive if the process restarts often.
| Broken pattern | Fixed pattern |
|---|---|
| Build index inside request handler | Build once at startup and reuse |
| Recompute embeddings every call | Persist index to disk or vector store |
Instantiate Settings.llm repeatedly | Configure once globally |
# BROKEN
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
def answer_question(question: str):
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
return query_engine.query(question)
# Every request reloads docs and rebuilds embeddings.
# FIXED
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage
from pathlib import Path
PERSIST_DIR = "./storage"
def build_or_load_index():
if Path(PERSIST_DIR).exists():
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
return load_index_from_storage(storage_context)
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir=PERSIST_DIR)
return index
index = build_or_load_index()
query_engine = index.as_query_engine()
def answer_question(question: str):
return query_engine.query(question)
If you are using FastAPI, Flask, or a background worker, keep the index in module scope or initialize it during app startup. Do not hide it inside the route handler unless you want repeated cold starts.
Other Possible Causes
1) Embedding model initialization is happening lazily
If you use OpenAI embeddings or another provider without warming the client up, the first embedding call can be slow.
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
If this is created inside a request path, move it to app startup.
2) You are hitting an LLM provider with no connection reuse
Creating a new client for every query increases latency. This is common with OpenAI, AzureOpenAI, or local HTTP-based models wrapped through LlamaIndex.
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Set it once. If you create OpenAI(...) per request, you pay setup cost every time.
3) Your vector store is remote and not warmed up
When using Pinecone, Qdrant, Weaviate, or Postgres-backed stores, the first query may pay network and connection setup costs.
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
client = QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(client=client, collection_name="docs")
Keep the client global and avoid recreating it in handlers.
4) You are running under auto-reload during development
Framework reloaders restart the process on file changes. That means your “warm” objects disappear and every change looks like a cold start.
uvicorn app:app --reload
That is fine for development, but expect slower first requests after each code change. If startup time matters during debugging, test without reload:
uvicorn app:app
How to Debug It
- •
Measure where time is spent Add timing around document loading, indexing, embedding, and querying.
import time start = time.perf_counter() documents = SimpleDirectoryReader("./data").load_data() print("load_data:", time.perf_counter() - start) start = time.perf_counter() index = VectorStoreIndex.from_documents(documents) print("from_documents:", time.perf_counter() - start) - •
Check whether indexing happens more than once Add a log line before
from_documents(...). If it prints on every request, that’s your problem. - •
Verify persistence If you expect a cached index but still see slow starts, confirm your storage directory exists and contains persisted state.
from pathlib import Path print(Path("./storage").exists()) - •
Isolate provider latency Temporarily swap your LLM/embedding model for a local stub or mock. If latency disappears, the bottleneck is external API initialization or network calls.
Prevention
- •Build indexes at startup and persist them with
StorageContext.persist(...). - •Configure
Settings.llmandSettings.embed_modelonce in one module. - •Avoid putting
SimpleDirectoryReader(...).load_data()orVectorStoreIndex.from_documents(...)inside request handlers. - •Treat
--reloadas a dev convenience, not a performance baseline.
If you want predictable startup behavior in LlamaIndex Python apps, make initialization explicit. Cold starts are almost always self-inflicted by lazy setup in the wrong place.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit