How to Fix 'cold start latency' in LlamaIndex (Python)
What “cold start latency” means in LlamaIndex
In LlamaIndex, cold start latency usually shows up when your first query is slow because the index, embeddings, vector store client, or model client is being initialized on demand. It’s most common in serverless apps, notebooks that rebuild state every run, or APIs that create a fresh StorageContext on each request.
You’ll typically see it as slow first-token time, long request durations, or logs around VectorStoreIndex, Settings.embed_model, StorageContext.from_defaults(), or load_index_from_storage().
The Most Common Cause
The #1 cause is rebuilding the index and embedding everything at request time.
If you create a new VectorStoreIndex.from_documents() inside your endpoint or handler, LlamaIndex has to tokenize, embed, and persist work before answering. That’s not a bug. That’s your code doing expensive startup work on the critical path.
Broken vs fixed pattern
| Broken | Fixed |
|---|---|
| Build index per request | Build once, reuse across requests |
| Embed documents during query path | Precompute embeddings during ingestion |
| Create new client objects every call | Keep clients global or cached |
# ❌ Broken: cold start happens on every request
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
def answer_query(query: str):
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs) # expensive
query_engine = index.as_query_engine()
return query_engine.query(query)
# ✅ Fixed: build once, then reuse persisted storage
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
from llama_index.core import SimpleDirectoryReader
PERSIST_DIR = "./storage"
def build_index_once():
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(persist_dir=PERSIST_DIR)
def load_index():
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
return load_index_from_storage(storage_context)
index = load_index()
query_engine = index.as_query_engine()
def answer_query(query: str):
return query_engine.query(query)
If you’re using FastAPI, do the load in startup code or dependency injection, not inside the route handler.
Other Possible Causes
1) Embedding model initialization on the hot path
Some setups create a new embedding client per request. That adds network setup and auth overhead before any retrieval happens.
# Bad
from llama_index.embeddings.openai import OpenAIEmbedding
def get_embedding():
return OpenAIEmbedding(model="text-embedding-3-small")
# Good
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
2) Using a remote vector store with no warm connection
If you’re on Pinecone, Qdrant Cloud, Weaviate Cloud, or Postgres/pgvector over TLS, the first connection can be slow. Creating the client repeatedly makes it worse.
# Bad: client recreated every call
def get_vector_store():
from llama_index.vector_stores.qdrant import QdrantVectorStore
return QdrantVectorStore(collection_name="docs")
# Good: initialize once at process startup
from llama_index.vector_stores.qdrant import QdrantVectorStore
vector_store = QdrantVectorStore(collection_name="docs")
3) Auto-loading documents from disk each query
A lot of people accidentally keep ingestion code next to query code. If you call SimpleDirectoryReader.load_data() in the request path, you pay filesystem I/O every time.
# Bad
docs = SimpleDirectoryReader("./data").load_data()
Move that into an offline ingestion job and persist the result.
4) Chunking and metadata extraction are too heavy
Large chunk overlap, recursive parsing, OCR, or metadata extractors can make initial indexing feel broken. The error message may still mention latency even though the root issue is preprocessing cost.
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(chunk_size=2048, chunk_overlap=512) # heavy for large corpora
Try smaller chunks and fewer extractors first.
How to Debug It
- •
Measure where time is spent Add timing around ingestion, index load, embedding init, and query execution. If
from_documents()takes seconds butquery()is fast after that, you’ve found the issue. - •
Check whether you are rebuilding state Search your code for:
- •
VectorStoreIndex.from_documents() - •
SimpleDirectoryReader.load_data() - •
StorageContext.from_defaults()without persistence reuse
If these appear inside a route handler or job loop, that’s likely the cause.
- •
- •
Confirm persistence is actually working If you call
persist()but still see cold starts every launch, verify the directory exists and contains files like:- •
docstore.json - •
index_store.json - •vector store files depending on backend
- •
- •
Look at your initialization logs Real LlamaIndex stack traces often point at classes like:
- •
VectorStoreIndex - •
StorageContext - •
BaseEmbedding - •
load_index_from_storage()
If startup logs show embedding calls before the first query returns, you’re initializing too late.
- •
Prevention
- •
Build ingestion as a separate pipeline from serving. Persist indexes once; never rebuild them per request unless data changed.
- •
Initialize shared objects at process startup. Keep
Settings.embed_model, vector store clients, and loaded indexes in module scope or app startup hooks. - •
Add a startup benchmark. Track cold start time separately from query latency so regressions are obvious before production users see them.
If you want this fixed properly in production, treat LlamaIndex as two systems: ingestion and serving. Cold start latency usually means they got mixed together.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit