How to Fix 'cold start latency' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21

cold-start-latencylangchaintypescript

When people say “cold start latency” in LangChain, they usually mean the first request is slow because models, embeddings, vector stores, or tool clients are being initialized on demand. In TypeScript, this shows up most often in serverless functions, API routes, or short-lived workers where every invocation starts from zero.

The fix is usually not in LangChain itself. It’s in how you initialize and reuse your chain, model client, and retriever.

The Most Common Cause

The #1 cause is creating the LLM chain inside the request handler instead of reusing a warmed instance. That forces LangChain to rebuild prompt templates, instantiate ChatOpenAI, and sometimes reconnect to external services on every call.

Here’s the broken pattern:

// app/api/chat/route.ts
import { ChatOpenAI } from "@langchain/openai";
import { PromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";

export async function POST(req: Request) {
  const { message } = await req.json();

  const llm = new ChatOpenAI({
    apiKey: process.env.OPENAI_API_KEY,
    model: "gpt-4o-mini",
  });

  const prompt = PromptTemplate.fromTemplate(
    "Answer the user clearly: {message}"
  );

  const chain = prompt.pipe(llm).pipe(new StringOutputParser());

  const result = await chain.invoke({ message });
  return Response.json({ result });
}

And here’s the fixed pattern:

// lib/chat-chain.ts
import { ChatOpenAI } from "@langchain/openai";
import { PromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";

const llm = new ChatOpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  model: "gpt-4o-mini",
});

const prompt = PromptTemplate.fromTemplate(
  "Answer the user clearly: {message}"
);

export const chatChain = prompt.pipe(llm).pipe(new StringOutputParser());

// app/api/chat/route.ts
import { chatChain } from "@/lib/chat-chain";

export async function POST(req: Request) {
  const { message } = await req.json();
  const result = await chatChain.invoke({ message });
  return Response.json({ result });
}

The difference is simple: build once, reuse many times. In serverless environments, this reduces cold-start work and avoids repeated client setup.

Other Possible Causes

1. You are loading large files or embeddings at request time

If you do this inside the handler, your first request pays for disk I/O and parsing.

// bad
export async function POST() {
  const docs = await fs.promises.readFile("./data/policies.md", "utf8");
  // chunk + embed here
}

Move it to startup or precompute it during deployment.

// better
const docsPromise = fs.promises.readFile("./data/policies.md", "utf8");

export async function POST() {
  const docs = await docsPromise;
}

2. Your vector store is recreated on every invocation

This is common with MemoryVectorStore, PineconeStore, or PGVectorStore when initialization happens inside the route.

// bad
const store = await MemoryVectorStore.fromTexts(texts, metadata, embeddings);
const retriever = store.asRetriever();

Reuse a singleton or load from a persistent index instead.

// better
let retrieverPromise: Promise<ReturnType<typeof createRetriever>> | null = null;

export function getRetriever() {
  if (!retrieverPromise) {
    retrieverPromise = createRetriever();
  }
  return retrieverPromise;
}

3. You are creating a new OpenAI client for every call

Even if the model call is fast, repeated instantiation adds overhead and can make latency look worse than it is.

// bad
export async function ask(question: string) {
  const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
  return llm.invoke(question);
}

// better
const llm = new ChatOpenAI({ model: "gpt-4o-mini" });

export async function ask(question: string) {
  return llm.invoke(question);
}

4. Your runtime is forcing full cold starts

If you deploy to serverless and use Node-only packages incorrectly, you can trigger slower boot paths or repeated bundling issues.

Check your runtime config:

export const runtime = "nodejs";
export const maxDuration = 30;

If you accidentally mix edge constraints with Node dependencies like filesystem access or native DB drivers, initialization gets slower or fails outright with errors like:

•Error [ERR_MODULE_NOT_FOUND]
•Cannot read properties of undefined (reading 'invoke')
•LangChainError: Failed to initialize vector store

How to Debug It

•
Measure where time is spent
- •Add timestamps around model creation, retriever creation, and .invoke().
- •If startup dominates before the first token comes back, it’s an initialization issue.
•
Log object creation once
- •If you see new ChatOpenAI() or fromTexts() running on every request, that’s your problem.
- •In production logs, this often shows up as repeated warmup messages.
•
Isolate LangChain pieces
- •Temporarily remove retrieval and tools.
- •Test only ChatOpenAI + PromptTemplate.
- •If latency drops sharply, the issue is in embeddings/vector store/tooling rather than the LLM call itself.
•
Check deployment behavior
- •On Vercel, AWS Lambda, Cloudflare Workers, or similar platforms, compare first-hit latency vs subsequent hits.
- •If only the first request is slow after idle periods, you’re dealing with cold starts outside LangChain too.

Prevention

•
Initialize chains at module scope
- •Create ChatOpenAI, prompts, retrievers, and parsers once per container lifecycle.
•
Prebuild retrieval assets
- •Generate embeddings and indexes during CI/CD or background jobs instead of on request paths.
•
Cache expensive clients
- •Use singleton patterns for database pools, vector stores, and external SDK clients.

Pattern	Result
Create chain inside handler	Slow first request on every cold start
Create chain at module scope	Faster warm invocations
Build embeddings on demand	High latency and wasted compute
Precompute embeddings/indexes	Predictable response times

If you’re seeing “cold start latency” in a LangChain TypeScript app, start by moving all expensive initialization out of the request path. In practice that fixes most cases before you need to touch anything else.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit