How to Fix 'cold start latency in production' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
cold-start-latency-in-productionlangchaintypescript

When people say they’re seeing “cold start latency in production” with LangChain TypeScript, they usually mean the first request after deploy or idle time is much slower than the rest. In practice, this shows up as slow first-token time, long Lambda/Cloud Run startup, or a chain that feels fine locally but stalls under real traffic.

The root cause is usually not LangChain itself. It’s almost always how the app initializes models, embeddings, vector stores, or serverless containers.

The Most Common Cause

The #1 cause is creating LLMs, embeddings, retrievers, or vector stores inside the request path instead of initializing them once and reusing them.

This is especially bad in serverless environments because every cold boot repeats expensive setup: SDK auth, network handshakes, model client creation, and vector index loading.

Broken vs fixed pattern

Broken patternFixed pattern
Initializes ChatOpenAI and MemoryVectorStore per requestInitializes once at module scope and reuses
Rebuilds embeddings on every callReuses embeddings client
Causes first request to hit RunnableSequence, ChatOpenAI, and retriever setup all at onceKeeps request path thin
// ❌ Broken: everything happens inside the handler
import { ChatOpenAI } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RunnableSequence } from "@langchain/core/runnables";

export async function POST(req: Request) {
  const body = await req.json();

  const llm = new ChatOpenAI({
    model: "gpt-4o-mini",
    temperature: 0,
  });

  const embeddings = new OpenAIEmbeddings();
  const store = await MemoryVectorStore.fromTexts(
    ["policy A", "policy B"],
    [{ id: 1 }, { id: 2 }],
    embeddings
  );

  const chain = RunnableSequence.from([
    async (input: string) => {
      const docs = await store.similaritySearch(input, 2);
      return docs.map((d) => d.pageContent).join("\n");
    },
    llm,
  ]);

  const result = await chain.invoke(body.question);
  return Response.json({ result });
}
// ✅ Fixed: initialize once and reuse across requests
import { ChatOpenAI } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RunnableSequence } from "@langchain/core/runnables";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

const embeddings = new OpenAIEmbeddings();

const storePromise = MemoryVectorStore.fromTexts(
  ["policy A", "policy B"],
  [{ id: 1 }, { id: 2 }],
  embeddings
);

export async function POST(req: Request) {
  const body = await req.json();
  const store = await storePromise;

  const chain = RunnableSequence.from([
    async (input: string) => {
      const docs = await store.similaritySearch(input, 2);
      return docs.map((d) => d.pageContent).join("\n");
    },
    llm,
  ]);

  const result = await chain.invoke(body.question);
  return Response.json({ result });
}

If you’re on AWS Lambda, Vercel Functions, or Cloud Run, this difference is huge. Module-scope initialization lets warm invocations skip repeated setup.

Other Possible Causes

1. Expensive prompt assembly on every request

If you’re reading large templates from disk or generating prompts dynamically, that work adds up.

// ❌ Reads and builds every request
const template = await fs.promises.readFile("prompt.txt", "utf8");
const prompt = PromptTemplate.fromTemplate(template);
// ✅ Load once
const template = await fs.promises.readFile("prompt.txt", "utf8");
const prompt = PromptTemplate.fromTemplate(template);

The fix is boring but effective: load static assets at startup.

2. Cold vector store / retriever hydration

A common LangChain stack uses VectorStoreRetriever. If you rebuild the index or reload documents on each invocation, the first call will crawl.

// ❌ Hydrates documents every time
const docs = await loadDocs();
const store = await Chroma.fromDocuments(docs, embeddings);
const retriever = store.asRetriever();
// ✅ Build once during boot
const retrieverPromise = (async () => {
  const docs = await loadDocs();
  const store = await Chroma.fromDocuments(docs, embeddings);
  return store.asRetriever();
})();

If your production data changes often, rebuild out of band and swap the index atomically.

3. Network handshakes with remote model providers

ChatOpenAI, Anthropic clients, and other provider SDKs can add noticeable latency on first use if you create them lazily inside handlers.

// ❌ Client created only when traffic arrives
export async function handler() {
  const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
}
// ✅ Create client at module load
const llm = new ChatOpenAI({ model: "gpt-4o-mini" });

Also check whether your runtime has outbound DNS delays or VPC egress issues. Those look like LangChain slowness but aren’t.

4. Using streaming without warming the path

If your first token is delayed but the rest flows normally, your issue may be stream setup rather than generation time.

const stream = await llm.stream(messages); // first token delayed by upstream setup

Make sure you’re measuring:

  • time to handler start
  • time to invoke
  • time to first token
  • total completion time

How to Debug It

  1. Measure startup separately from inference

    • Add timestamps around module init and inside the route handler.
    • If init is slow but requests are fast afterward, it’s a cold-start problem.
  2. Log which LangChain classes are created per request

    • Look for repeated creation of:
      • ChatOpenAI
      • OpenAIEmbeddings
      • MemoryVectorStore
      • RunnableSequence
      • ConversationalRetrievalQAChain
    • Anything expensive should usually live outside the handler.
  3. Disable everything except one hop

    • Call the model directly with a minimal prompt.
    • Then add retriever logic.
    • Then add document loading.
    • The step that makes latency jump is your culprit.
  4. Check runtime-specific cold start behavior

    • Serverless functions may freeze containers after idle periods.
    • Edge runtimes can have different networking constraints.
    • Container platforms may scale to zero under low traffic.

Prevention

  • Initialize LangChain clients and retrievers at module scope whenever possible.
  • Prebuild vector indexes and prompt assets during deploy or background jobs.
  • Add latency metrics for:
    • container start
    • chain construction
    • retrieval time
    • first token time

If you treat LangChain like a per-request factory instead of a reusable runtime component, you’ll keep paying cold-start tax forever. Keep initialization out of the hot path and your production latency will stop looking random.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides