How to Fix 'cold start latency when scaling' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-22
cold-start-latency-when-scalingllamaindextypescript

Cold start latency when scaling usually means your LlamaIndex TypeScript service is paying initialization cost on the first request after a scale-up event. In practice, this shows up when a new pod, Lambda, or serverless instance has to load embeddings, connect to a vector store, build indexes, or instantiate an LLM client before it can answer traffic.

The error is often reported alongside slow first-token time, request timeouts, or logs like RetrieverQueryEngine.query() taking several seconds on the first call after deployment. If you’re seeing this only when traffic increases or after idle periods, you’re not dealing with a query bug — you’re dealing with startup work happening on the hot path.

The Most Common Cause

The #1 cause is rebuilding heavy LlamaIndex objects per request instead of reusing them across requests.

That means creating OpenAIEmbedding, VectorStoreIndex, StorageContext, or even loading documents inside the handler. When the instance scales out, every new worker repeats that setup and your first query pays the full cost.

Broken vs fixed

Broken patternFixed pattern
Build index inside the request handlerBuild once at module startup or during app boot
Reconnect to vector DB every requestReuse a singleton client/index
Load docs and embed on demandPrecompute and persist the index
// ❌ Broken: expensive setup runs on every request
import { OpenAI } from "llamaindex";
import { VectorStoreIndex, Document } from "llamaindex";

export async function handler(req: Request) {
  const llm = new OpenAI({ model: "gpt-4o-mini" });

  const docs = [
    new Document({ text: await req.text() }),
  ];

  const index = await VectorStoreIndex.fromDocuments(docs, {
    llm,
  });

  const queryEngine = index.asQueryEngine();
  const response = await queryEngine.query({ query: "Summarize this" });

  return Response.json({ answer: response.toString() });
}
// ✅ Fixed: initialize once and reuse
import { OpenAI, VectorStoreIndex } from "llamaindex";

let queryEnginePromise: Promise<ReturnType<VectorStoreIndex["asQueryEngine"]>> | null = null;

async function getQueryEngine() {
  if (!queryEnginePromise) {
    queryEnginePromise = (async () => {
      const index = await VectorStoreIndex.init({
        // load from persisted storage or external vector DB
      });

      return index.asQueryEngine();
    })();
  }

  return queryEnginePromise;
}

export async function handler(req: Request) {
  const queryEngine = await getQueryEngine();
  const response = await queryEngine.query({
    query: await req.text(),
  });

  return Response.json({ answer: response.toString() });
}

The key point: VectorStoreIndex.fromDocuments() is not something you want in a request path for production traffic. If you need ingestion, do it in a separate job and persist the result.

Other Possible Causes

1. Lazy embedding model initialization

If your embedding model is created inside the route, the first request after scale-up will block while credentials load and network connections establish.

// ❌
app.post("/query", async (req) => {
  const embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small" });
  // ...
});
// ✅
const embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small" });

app.post("/query", async (req) => {
  // reuse embedModel
});

2. No persisted storage context

If you rely on in-memory indexes, every new instance starts empty and rebuilds state.

// ❌ ephemeral memory only
const index = await VectorStoreIndex.fromDocuments(docs);
// ✅ persist and reload
await storageContext.persist("./storage");

// later on startup
const storageContext = await StorageContext.fromDefaults({
  persistDir: "./storage",
});
const index = await VectorStoreIndex.init({ storageContext });

3. Vector DB connection pooling is missing

A fresh connection for each request can look like “cold start latency” when scaling because every pod opens its own TCP/TLS session.

// ❌ client created per request
app.post("/query", async () => {
  const pineconeClient = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
});
// ✅ singleton client
const pineconeClient = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });

4. Large document ingestion happening synchronously

If you ingest PDFs or HTML during the user request, scaling amplifies the delay.

// ❌ parse + chunk + embed in request path
const docs = await SimpleDirectoryReader.loadData("./docs");
await VectorStoreIndex.fromDocuments(docs);

Move that work to a background worker or cron job, then query the persisted index from your API.

How to Debug It

  1. Check where startup ends and request handling begins

    • Add timestamps around app boot, index creation, and first query.
    • If VectorStoreIndex.fromDocuments() appears in request logs, that’s your problem.
  2. Look for repeated object creation

    • Search for new OpenAI(...), new OpenAIEmbedding(...), VectorStoreIndex.fromDocuments(...), and StorageContext.fromDefaults(...) inside handlers.
    • Anything inside a route is suspect unless it’s cheap and stateless.
  3. Inspect cold vs warm behavior

    • Hit the same endpoint twice.
    • If first request takes 8–20 seconds and second takes <500ms, you have cold-start work being done lazily.
  4. Trace external calls

    • Enable logging around vector DB queries and LLM calls.
    • A delay before the first RetrieverQueryEngine.query() usually means initialization, not retrieval.

Prevention

  • Initialize LlamaIndex clients and engines at module scope or application boot.
  • Persist indexes and storage context; do not rebuild embeddings on demand.
  • Keep ingestion out of user-facing requests. Use workers for document parsing, chunking, and embedding.
  • Add a warmup endpoint or startup probe if you deploy on Kubernetes or serverless platforms that scale from zero.

If you’re still seeing cold start latency when scaling after moving initialization out of the handler, check whether your platform is recycling instances too aggressively. In most TypeScript LlamaIndex deployments, though, the fix is simple: stop building your retrieval stack on the hot path.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides