How to Fix 'cold start latency in production' in LlamaIndex (TypeScript)
If you’re seeing cold start latency in production with LlamaIndex in TypeScript, you’re usually not dealing with a single “bug.” You’re looking at an initialization problem: the first request is paying the full cost of loading models, building indexes, creating vector clients, or opening network connections.
This shows up most often in serverless APIs, edge functions, and containerized services that spin up frequently. The symptom is simple: first request is slow, later requests are fine.
The Most Common Cause
The #1 cause is rebuilding your StorageContext, VectorStoreIndex, or query engine on every request instead of reusing them across invocations.
In LlamaIndex TypeScript, this often looks like creating a new OpenAIEmbedding, PineconeVectorStore, or VectorStoreIndex.fromDocuments() inside the handler. That forces cold initialization every time.
Wrong pattern vs right pattern
| Wrong | Right |
|---|---|
| Build index inside request handler | Initialize once at module scope or reuse a cached promise |
| Create new vector client per request | Reuse singleton client |
| Load documents on every call | Prebuild index during startup/deploy |
// WRONG: cold start cost on every request
import { VectorStoreIndex } from "llamaindex";
import { OpenAIEmbedding } from "@llamaindex/openai";
import { PineconeVectorStore } from "@llamaindex/pinecone";
export async function POST(req: Request) {
const docs = await loadDocuments(); // expensive every time
const vectorStore = new PineconeVectorStore({
apiKey: process.env.PINECONE_API_KEY!,
indexName: "support-prod",
});
const index = await VectorStoreIndex.fromDocuments(docs, {
embedModel: new OpenAIEmbedding({
model: "text-embedding-3-small",
}),
storageContext: {
vectorStore,
},
});
const engine = index.asQueryEngine();
const result = await engine.query("What is our refund policy?");
return Response.json({ answer: result.toString() });
}
// RIGHT: initialize once and reuse
import { VectorStoreIndex } from "llamaindex";
import { OpenAIEmbedding } from "@llamaindex/openai";
import { PineconeVectorStore } from "@llamaindex/pinecone";
const vectorStore = new PineconeVectorStore({
apiKey: process.env.PINECONE_API_KEY!,
indexName: "support-prod",
});
const embedModel = new OpenAIEmbedding({
model: "text-embedding-3-small",
});
let indexPromise: Promise<VectorStoreIndex> | null = null;
async function getIndex() {
if (!indexPromise) {
indexPromise = VectorStoreIndex.init({
storageContext: { vectorStore },
embedModel,
});
}
return indexPromise;
}
export async function POST(req: Request) {
const index = await getIndex();
const engine = index.asQueryEngine();
const result = await engine.query("What is our refund policy?");
return Response.json({ answer: result.toString() });
}
If you’re using Next.js route handlers, Lambda, or any runtime that reuses containers, this pattern cuts the first-hit penalty hard. If you’re using serverless with frequent cold starts, it also prevents repeated reconnects to Pinecone, Postgres, or Redis.
Other Possible Causes
1) Embedding model initialization on the hot path
If you instantiate OpenAIEmbedding inside the request path, you pay setup cost repeatedly.
// Bad
export async function POST() {
const embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small" });
}
// Good
const embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small" });
2) Loading documents from disk or S3 during each request
This becomes brutal when SimpleDirectoryReader or custom loaders run per request.
// Bad
const docs = await new SimpleDirectoryReader().loadData("./data");
// Good
const docsPromise = new SimpleDirectoryReader().loadData("./data");
If the corpus changes rarely, build the index offline and ship only retrieval at runtime.
3) Creating a fresh LLM client every time
A lot of teams create new OpenAI() or provider clients inside handlers. That adds connection setup and can trigger extra DNS/TLS overhead.
// Bad
export async function POST() {
const llm = new OpenAI({ model: "gpt-4o-mini" });
}
// Good
const llm = new OpenAI({ model: "gpt-4o-mini" });
4) Using an external vector store without connection pooling
If your app hits PostgreSQL/pgvector or a remote vector DB without pooling or keep-alives, the first query will be slow and subsequent ones may still suffer.
// Example fix for pg-based stores
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 10,
});
For managed stores like Pinecone, keep the client singleton alive for the process lifetime.
How to Debug It
- •
Measure where the time goes
- •Add timing around document loading, embedding init, vector store init, and query execution.
- •If
loadData()orinit()takes most of the time, you found it.
- •
Check whether initialization happens per request
- •Search for
new VectorStoreIndex,fromDocuments,new OpenAIEmbedding, andnew PineconeVectorStoreinside handlers. - •Anything inside
POST(),GET(), or Lambda entrypoints is suspect.
- •Search for
- •
Look for repeated cold-start logs
- •In serverless logs, check whether each slow request starts with module bootstrap messages.
- •If every invocation looks like:
- •
Initializing StorageContext... - •
Loading documents... - •
Building VectorStoreIndex...then you are rebuilding state too often.
- •
- •
Test module-scope caching
- •Move all expensive setup outside the handler.
- •If latency drops on the second call but not the first, your issue is startup cost rather than query logic.
Prevention
- •Build indexes offline whenever possible.
- •Keep LlamaIndex clients and vector store connections at module scope.
- •Cache promises for one-time initialization instead of recreating objects per request.
- •Avoid loading documents dynamically unless you truly need live ingestion.
If you want predictable production latency with LlamaIndex TypeScript, treat index construction as deployment work, not request work. The runtime should query an already-warmed retrieval stack, not assemble one under user traffic.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit