How to Fix 'cold start latency during development' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-22
cold-start-latency-during-developmentllamaindextypescript

When you see cold start latency during development in a LlamaIndex TypeScript app, it usually means your first request is doing too much work at runtime. In practice, this shows up when you initialize models, load documents, build indexes, or create the query engine inside the request path instead of once at startup.

The result is slow first-token response, timeouts in local dev, and sometimes repeated re-initialization on every hot reload. In LlamaIndex TS, the usual suspects are VectorStoreIndex.fromDocuments(), OpenAIEmbedding, OpenAI, and index.asQueryEngine() being created inside handlers.

The Most Common Cause

The #1 cause is rebuilding the index on every request.

That means your app is paying the embedding + indexing cost repeatedly. In TypeScript, this often happens in an Express route, Next.js API route, or serverless handler.

Broken patternFixed pattern
Build index inside request handlerBuild once at startup or cache it
Recreate OpenAI / OpenAIEmbedding each callReuse singleton instances
Call fromDocuments() per requestPersist or memoize the index

Broken code

import express from "express";
import { Document } from "llamaindex";
import { OpenAI } from "@llamaindex/openai";
import { OpenAIEmbedding } from "@llamaindex/openai";
import { VectorStoreIndex } from "llamaindex";

const app = express();

app.get("/ask", async (req, res) => {
  const llm = new OpenAI({ model: "gpt-4o-mini" });
  const embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small" });

  const docs = [
    new Document({ text: "ACME insurance policy terms..." }),
  ];

  const index = await VectorStoreIndex.fromDocuments(docs, {
    embedModel,
  });

  const queryEngine = index.asQueryEngine({ llm });
  const response = await queryEngine.query({
    query: "What does the policy cover?",
  });

  res.json({ answer: response.toString() });
});

app.listen(3000);

Fixed code

import express from "express";
import { Document } from "llamaindex";
import { OpenAI } from "@llamaindex/openai";
import { OpenAIEmbedding } from "@llamaindex/openai";
import { VectorStoreIndex } from "llamaindex";

const app = express();

const llm = new OpenAI({ model: "gpt-4o-mini" });
const embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small" });

let indexPromise: Promise<VectorStoreIndex> | null = null;

async function getIndex() {
  if (!indexPromise) {
    const docs = [
      new Document({ text: "ACME insurance policy terms..." }),
    ];

    indexPromise = VectorStoreIndex.fromDocuments(docs, {
      embedModel,
    });
  }

  return indexPromise;
}

app.get("/ask", async (req, res) => {
  const index = await getIndex();
  const queryEngine = index.asQueryEngine({ llm });

  const response = await queryEngine.query({
    query: "What does the policy cover?",
  });

  res.json({ answer: response.toString() });
});

app.listen(3000);

This fixes the cold-start spike because the expensive part runs once. If you need to refresh content, invalidate the cached promise explicitly instead of rebuilding per request.

Other Possible Causes

1) Hot reload is recreating everything on file change

In Next.js dev mode or Vite-based servers, module reload can rerun initialization code. If your singleton lives at module scope but gets re-executed on every edit, you still pay the startup cost repeatedly.

// bad: runs again on every reload
export const llm = new OpenAI({ model: "gpt-4o-mini" });
export const embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small" });

Use a global cache in dev:

const g = globalThis as typeof globalThis & {
  llm?: OpenAI;
};

export const llm =
  g.llm ?? (g.llm = new OpenAI({ model: "gpt-4o-mini" }));

2) You are loading documents synchronously at startup

If you read large files or fetch remote data before serving requests, the app looks like it has “cold start latency” even though the real issue is initialization work.

// bad
const docs = await loadAllPoliciesFromS3();
const index = await VectorStoreIndex.fromDocuments(docs);

Move ingestion to a separate build step or background job:

// better: load prebuilt persisted index
const storageContext = await StorageContext.load({
  persistDir: "./storage",
});
const index = await loadIndexFromStorage(storageContext);

3) Embedding/model clients are misconfigured and retrying

A bad API key, rate limit issue, or network timeout can look like latency. LlamaIndex will often surface this through underlying SDK errors such as 429 Too Many Requests, ETIMEDOUT, or fetch failures during embedding.

const embedModel = new OpenAIEmbedding({
  model: "text-embedding-3-small",
  apiKey: process.env.OPENAI_API_KEY,
});

Check that the key exists and that your local environment can reach the provider. In dev containers and WSL setups, DNS and proxy issues are common.

4) You’re using an oversized chunking pipeline

If your splitter creates too many chunks, indexing becomes slow fast. That’s especially visible with large PDFs and legal docs.

const splitter = new SentenceSplitter({
  chunkSize: 128,
  chunkOverlap: 64,
});

For development, use larger chunks while testing retrieval behavior:

const splitter = new SentenceSplitter({
  chunkSize: 512,
  chunkOverlap: 50,
});

Smaller chunks increase embedding calls and memory pressure. That makes cold start worse even when everything else is correct.

How to Debug It

  1. Time each stage separately

    • Log timestamps around document loading, embedding, indexing, and querying.
    • If fromDocuments() dominates, you found the issue.
  2. Check whether initialization runs more than once

    • Add a log at module scope and inside handlers.
    • If you see repeated logs in dev mode, hot reload is recreating state.
  3. Inspect which class is slowing down

    • Look for delays around OpenAIEmbedding, VectorStoreIndex, or asQueryEngine().
    • If requests hang before any query executes, it’s usually indexing or client setup.
  4. Run with a persisted index

    • Save the index to disk once.
    • Restart the app and load from storage.
    • If latency disappears, your problem was rebuild-on-startup.

Prevention

  • Keep embeddings and indexes out of request handlers.
  • Persist indexes for anything beyond toy data.
  • Use singleton clients for OpenAI and OpenAIEmbedding, especially in dev servers with hot reload.
  • Add timing logs around ingestion so you catch regressions before they hit production.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides