How to Fix 'cold start latency' in LlamaIndex (TypeScript)
If you’re seeing cold start latency in a LlamaIndex TypeScript app, it usually means your first query is paying the full initialization cost: model client setup, index loading, embedding generation, or remote fetches. In practice, this shows up on the first request after deploy, after a serverless cold start, or when you recreate the index on every API call.
The fix is usually not “optimize the model.” It’s “stop rebuilding expensive objects per request.”
The Most Common Cause
The #1 cause is creating your Settings, OpenAI, VectorStoreIndex, or retriever inside the request handler. That forces LlamaIndex to initialize everything on every call, which makes the first request slow and triggers the exact symptom people describe as cold start latency.
Here’s the broken pattern:
// app/api/chat.ts
import { OpenAI } from "llamaindex";
import { VectorStoreIndex } from "llamaindex";
export async function POST(req: Request) {
const body = await req.json();
const llm = new OpenAI({
model: "gpt-4o-mini",
apiKey: process.env.OPENAI_API_KEY!,
});
const index = await VectorStoreIndex.fromDocuments(body.documents);
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({ query: body.question });
return Response.json({ answer: response.toString() });
}
And here’s the fixed pattern:
// lib/llama.ts
import { OpenAI, Settings, VectorStoreIndex } from "llamaindex";
Settings.llm = new OpenAI({
model: "gpt-4o-mini",
apiKey: process.env.OPENAI_API_KEY!,
});
let indexPromise: Promise<VectorStoreIndex> | null = null;
export function getIndex() {
if (!indexPromise) {
indexPromise = VectorStoreIndex.fromDocuments([]);
}
return indexPromise;
}
// app/api/chat.ts
import { getIndex } from "@/lib/llama";
export async function POST(req: Request) {
const body = await req.json();
const index = await getIndex();
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({ query: body.question });
return Response.json({ answer: response.toString() });
}
The important part is that OpenAI, Settings, and any expensive index construction happen once per process, not once per request. In serverless environments, you still get cold starts across instances, but you stop making them worse.
Other Possible Causes
Rebuilding embeddings on every query
If you call fromDocuments() or embed raw text during each request, you’re doing ingestion work in the hot path.
// bad
const index = await VectorStoreIndex.fromDocuments(docs);
Instead, build once during ingestion and load later.
// good
const storageContext = await StorageContext.load({ persistDir: "./storage" });
const index = await loadIndexFromStorage(storageContext);
Not persisting the index
If your app creates an in-memory index and never persists it, every deploy starts from zero.
await storageContext.persist("./storage");
Then restore it on startup:
const storageContext = await StorageContext.load({ persistDir: "./storage" });
const index = await loadIndexFromStorage(storageContext);
Creating a new client for every call
A fresh OpenAI, Anthropic, or vector store client per request adds connection setup overhead.
// bad
export async function handler() {
const llm = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
}
Use a module-level singleton instead:
// good
export const llm = new OpenAI({
apiKey: process.env.OPENAI_API_KEY!,
});
Serverless runtime constraints
If you’re on Vercel, AWS Lambda, or Cloudflare-style runtimes, startup time can dominate your p95. The code may be correct but still slow because the platform keeps freezing and recreating execution contexts.
A typical symptom is this sequence in logs:
- •first request after idle is slow
- •subsequent requests are fast
- •no code changes fix it completely
In that case, pre-warm the function or move heavy initialization out of the request path.
How to Debug It
- •
Measure where time is spent Add timestamps around initialization and query execution.
console.time("init"); const index = await getIndex(); console.timeEnd("init"); console.time("query"); const result = await queryEngine.query({ query }); console.timeEnd("query"); - •
Check whether objects are recreated Log once at module load and once inside the handler.
console.log("module loaded"); export async function POST() { console.log("handler executed"); }If initialization logs appear on every request, you found the issue.
- •
Inspect persistence If you expect a stored index but don’t see files under your persist directory, you’re rebuilding from scratch.
- •
./storage/docstore.json - •
./storage/index_store.json - •
./storage/vector_store.json
- •
- •
Look for repeated embedding calls If your logs show repeated calls to
embedModel.getTextEmbedding()or similar methods during queries, ingestion is leaking into runtime.
Prevention
- •Initialize
Settings, model clients, and retrievers at module scope. - •Persist indexes during ingestion and load them at startup.
- •Keep document embedding and indexing out of API handlers.
- •In serverless apps, assume cold starts happen and design for fast bootstrap.
- •Add timing logs around init vs query so regressions show up immediately.
If you want a simple rule: anything that touches documents, embeddings, or storage should happen before traffic hits the endpoint. Anything inside the handler should be limited to reading input and running retrieval/query logic.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit