How to Fix 'cold start latency in production' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-22

cold-start-latency-in-productionllamaindextypescript

If you’re seeing cold start latency in production with LlamaIndex in TypeScript, you’re usually not dealing with a single “bug.” You’re looking at an initialization problem: the first request is paying the full cost of loading models, building indexes, creating vector clients, or opening network connections.

This shows up most often in serverless APIs, edge functions, and containerized services that spin up frequently. The symptom is simple: first request is slow, later requests are fine.

The Most Common Cause

The #1 cause is rebuilding your StorageContext, VectorStoreIndex, or query engine on every request instead of reusing them across invocations.

In LlamaIndex TypeScript, this often looks like creating a new OpenAIEmbedding, PineconeVectorStore, or VectorStoreIndex.fromDocuments() inside the handler. That forces cold initialization every time.

Wrong pattern vs right pattern

Wrong	Right
Build index inside request handler	Initialize once at module scope or reuse a cached promise
Create new vector client per request	Reuse singleton client
Load documents on every call	Prebuild index during startup/deploy

// WRONG: cold start cost on every request
import { VectorStoreIndex } from "llamaindex";
import { OpenAIEmbedding } from "@llamaindex/openai";
import { PineconeVectorStore } from "@llamaindex/pinecone";

export async function POST(req: Request) {
  const docs = await loadDocuments(); // expensive every time

  const vectorStore = new PineconeVectorStore({
    apiKey: process.env.PINECONE_API_KEY!,
    indexName: "support-prod",
  });

  const index = await VectorStoreIndex.fromDocuments(docs, {
    embedModel: new OpenAIEmbedding({
      model: "text-embedding-3-small",
    }),
    storageContext: {
      vectorStore,
    },
  });

  const engine = index.asQueryEngine();
  const result = await engine.query("What is our refund policy?");
  return Response.json({ answer: result.toString() });
}

// RIGHT: initialize once and reuse
import { VectorStoreIndex } from "llamaindex";
import { OpenAIEmbedding } from "@llamaindex/openai";
import { PineconeVectorStore } from "@llamaindex/pinecone";

const vectorStore = new PineconeVectorStore({
  apiKey: process.env.PINECONE_API_KEY!,
  indexName: "support-prod",
});

const embedModel = new OpenAIEmbedding({
  model: "text-embedding-3-small",
});

let indexPromise: Promise<VectorStoreIndex> | null = null;

async function getIndex() {
  if (!indexPromise) {
    indexPromise = VectorStoreIndex.init({
      storageContext: { vectorStore },
      embedModel,
    });
  }
  return indexPromise;
}

export async function POST(req: Request) {
  const index = await getIndex();
  const engine = index.asQueryEngine();
  const result = await engine.query("What is our refund policy?");
  return Response.json({ answer: result.toString() });
}

If you’re using Next.js route handlers, Lambda, or any runtime that reuses containers, this pattern cuts the first-hit penalty hard. If you’re using serverless with frequent cold starts, it also prevents repeated reconnects to Pinecone, Postgres, or Redis.

Other Possible Causes

1) Embedding model initialization on the hot path

If you instantiate OpenAIEmbedding inside the request path, you pay setup cost repeatedly.

// Bad
export async function POST() {
  const embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small" });
}

// Good
const embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small" });

2) Loading documents from disk or S3 during each request

This becomes brutal when SimpleDirectoryReader or custom loaders run per request.

// Bad
const docs = await new SimpleDirectoryReader().loadData("./data");

// Good
const docsPromise = new SimpleDirectoryReader().loadData("./data");

If the corpus changes rarely, build the index offline and ship only retrieval at runtime.

3) Creating a fresh LLM client every time

A lot of teams create new OpenAI() or provider clients inside handlers. That adds connection setup and can trigger extra DNS/TLS overhead.

// Bad
export async function POST() {
  const llm = new OpenAI({ model: "gpt-4o-mini" });
}

// Good
const llm = new OpenAI({ model: "gpt-4o-mini" });

4) Using an external vector store without connection pooling

If your app hits PostgreSQL/pgvector or a remote vector DB without pooling or keep-alives, the first query will be slow and subsequent ones may still suffer.

// Example fix for pg-based stores
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 10,
});

For managed stores like Pinecone, keep the client singleton alive for the process lifetime.

How to Debug It

•
Measure where the time goes
- •Add timing around document loading, embedding init, vector store init, and query execution.
- •If loadData() or init() takes most of the time, you found it.
•
Check whether initialization happens per request
- •Search for new VectorStoreIndex, fromDocuments, new OpenAIEmbedding, and new PineconeVectorStore inside handlers.
- •Anything inside POST(), GET(), or Lambda entrypoints is suspect.
•
Look for repeated cold-start logs
- •In serverless logs, check whether each slow request starts with module bootstrap messages.
- •
  If every invocation looks like:
  - •Initializing StorageContext...
  - •Loading documents...
  - •Building VectorStoreIndex... then you are rebuilding state too often.
•
Test module-scope caching
- •Move all expensive setup outside the handler.
- •If latency drops on the second call but not the first, your issue is startup cost rather than query logic.

Prevention

•Build indexes offline whenever possible.
•Keep LlamaIndex clients and vector store connections at module scope.
•Cache promises for one-time initialization instead of recreating objects per request.
•Avoid loading documents dynamically unless you truly need live ingestion.

If you want predictable production latency with LlamaIndex TypeScript, treat index construction as deployment work, not request work. The runtime should query an already-warmed retrieval stack, not assemble one under user traffic.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit