How to Fix 'cold start latency' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-22

cold-start-latencyllamaindexpython

What “cold start latency” means in LlamaIndex

In LlamaIndex, cold start latency usually shows up when your first query is slow because the index, embeddings, vector store client, or model client is being initialized on demand. It’s most common in serverless apps, notebooks that rebuild state every run, or APIs that create a fresh StorageContext on each request.

You’ll typically see it as slow first-token time, long request durations, or logs around VectorStoreIndex, Settings.embed_model, StorageContext.from_defaults(), or load_index_from_storage().

The Most Common Cause

The #1 cause is rebuilding the index and embedding everything at request time.

If you create a new VectorStoreIndex.from_documents() inside your endpoint or handler, LlamaIndex has to tokenize, embed, and persist work before answering. That’s not a bug. That’s your code doing expensive startup work on the critical path.

Broken vs fixed pattern

Broken	Fixed
Build index per request	Build once, reuse across requests
Embed documents during query path	Precompute embeddings during ingestion
Create new client objects every call	Keep clients global or cached

# ❌ Broken: cold start happens on every request
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

def answer_query(query: str):
    docs = SimpleDirectoryReader("./data").load_data()
    index = VectorStoreIndex.from_documents(docs)  # expensive
    query_engine = index.as_query_engine()
    return query_engine.query(query)

# ✅ Fixed: build once, then reuse persisted storage
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
from llama_index.core import SimpleDirectoryReader

PERSIST_DIR = "./storage"

def build_index_once():
    docs = SimpleDirectoryReader("./data").load_data()
    index = VectorStoreIndex.from_documents(docs)
    index.storage_context.persist(persist_dir=PERSIST_DIR)

def load_index():
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    return load_index_from_storage(storage_context)

index = load_index()
query_engine = index.as_query_engine()

def answer_query(query: str):
    return query_engine.query(query)

If you’re using FastAPI, do the load in startup code or dependency injection, not inside the route handler.

Other Possible Causes

1) Embedding model initialization on the hot path

Some setups create a new embedding client per request. That adds network setup and auth overhead before any retrieval happens.

# Bad
from llama_index.embeddings.openai import OpenAIEmbedding

def get_embedding():
    return OpenAIEmbedding(model="text-embedding-3-small")

# Good
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

2) Using a remote vector store with no warm connection

If you’re on Pinecone, Qdrant Cloud, Weaviate Cloud, or Postgres/pgvector over TLS, the first connection can be slow. Creating the client repeatedly makes it worse.

# Bad: client recreated every call
def get_vector_store():
    from llama_index.vector_stores.qdrant import QdrantVectorStore
    return QdrantVectorStore(collection_name="docs")

# Good: initialize once at process startup
from llama_index.vector_stores.qdrant import QdrantVectorStore

vector_store = QdrantVectorStore(collection_name="docs")

3) Auto-loading documents from disk each query

A lot of people accidentally keep ingestion code next to query code. If you call SimpleDirectoryReader.load_data() in the request path, you pay filesystem I/O every time.

# Bad
docs = SimpleDirectoryReader("./data").load_data()

Move that into an offline ingestion job and persist the result.

4) Chunking and metadata extraction are too heavy

Large chunk overlap, recursive parsing, OCR, or metadata extractors can make initial indexing feel broken. The error message may still mention latency even though the root issue is preprocessing cost.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=2048, chunk_overlap=512)  # heavy for large corpora

Try smaller chunks and fewer extractors first.

How to Debug It

•
Measure where time is spent Add timing around ingestion, index load, embedding init, and query execution. If from_documents() takes seconds but query() is fast after that, you’ve found the issue.
•
Check whether you are rebuilding state Search your code for:
- •VectorStoreIndex.from_documents()
- •SimpleDirectoryReader.load_data()
- •StorageContext.from_defaults() without persistence reuse
  If these appear inside a route handler or job loop, that’s likely the cause.
•
Confirm persistence is actually working If you call persist() but still see cold starts every launch, verify the directory exists and contains files like:
- •docstore.json
- •index_store.json
- •vector store files depending on backend
•
Look at your initialization logs Real LlamaIndex stack traces often point at classes like:
- •VectorStoreIndex
- •StorageContext
- •BaseEmbedding
- •load_index_from_storage()
  If startup logs show embedding calls before the first query returns, you’re initializing too late.

Prevention

•
Build ingestion as a separate pipeline from serving. Persist indexes once; never rebuild them per request unless data changed.
•
Initialize shared objects at process startup. Keep Settings.embed_model, vector store clients, and loaded indexes in module scope or app startup hooks.
•
Add a startup benchmark. Track cold start time separately from query latency so regressions are obvious before production users see them.

If you want this fixed properly in production, treat LlamaIndex as two systems: ingestion and serving. Cold start latency usually means they got mixed together.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit