How to Fix 'cold start latency when scaling' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-22
cold-start-latency-when-scalinglangchaintypescript

When you see “cold start latency when scaling” in a LangChain TypeScript app, it usually means your first requests after a scale-up are slow because the runtime is rebuilding too much state on demand. In practice, this shows up when your service autos-scales, a new container comes online, and LangChain has to initialize models, retrievers, vector stores, or chains during the request path.

This is not a LangChain bug by itself. It’s usually a deployment pattern problem: expensive initialization happening inside the hot path instead of at process startup or in a warm pool.

The Most Common Cause

The #1 cause is creating LangChain objects inside every request.

A lot of people do this in serverless handlers or HTTP routes:

  • instantiate ChatOpenAI
  • reconnect to vector stores
  • rebuild RunnableSequence
  • re-load prompts and embeddings

That works locally. Under scale, it causes cold-start spikes because every new instance pays the full setup cost before serving traffic.

Broken vs fixed pattern

BrokenFixed
Build chain per requestBuild once, reuse across requests
Reconnect embeddings every callInitialize clients at module scope
Load vector store on demandWarm it during startup
// ❌ Broken: expensive setup runs on every request
import { ChatOpenAI } from "@langchain/openai";
import { RunnableSequence } from "@langchain/core/runnables";
import { PromptTemplate } from "@langchain/core/prompts";

export async function POST(req: Request) {
  const { question } = await req.json();

  const model = new ChatOpenAI({
    apiKey: process.env.OPENAI_API_KEY!,
    model: "gpt-4o-mini",
  });

  const prompt = PromptTemplate.fromTemplate(
    "Answer the question briefly: {question}"
  );

  const chain = RunnableSequence.from([
    prompt,
    model,
  ]);

  const answer = await chain.invoke({ question });
  return Response.json({ answer });
}
// ✅ Fixed: initialize once at module scope
import { ChatOpenAI } from "@langchain/openai";
import { RunnableSequence } from "@langchain/core/runnables";
import { PromptTemplate } from "@langchain/core/prompts";

const model = new ChatOpenAI({
  apiKey: process.env.OPENAI_API_KEY!,
  model: "gpt-4o-mini",
});

const prompt = PromptTemplate.fromTemplate(
  "Answer the question briefly: {question}"
);

const chain = RunnableSequence.from([
  prompt,
  model,
]);

export async function POST(req: Request) {
  const { question } = await req.json();
  const answer = await chain.invoke({ question });
  return Response.json({ answer });
}

If you’re using Next.js route handlers or AWS Lambda, this matters even more. Module-scope objects can be reused across warm invocations, while request-local construction guarantees repeated cold starts.

Other Possible Causes

1. Rebuilding embeddings or vector stores on each request

If you call new OpenAIEmbeddings() and re-open Pinecone, Supabase, or pgvector connections per request, scaling will punish you.

// ❌ Bad
const store = await MemoryVectorStore.fromTexts(texts, metadatas, embeddings);
// ✅ Better
const storePromise = MemoryVectorStore.fromTexts(texts, metadatas, embeddings);

export async function getStore() {
  return storePromise;
}

For production vector DBs, create the client once and reuse it.

2. Lazy-loading large prompt/tool registries during traffic

If your agent loads dozens of tools only when the first user hits it, that first pod will look broken.

// ❌ Bad: tool registry built on demand
export async function buildAgent() {
  const tools = await loadAllToolsFromDb();
  return createReactAgent({ llm, tools });
}
// ✅ Better: warm at startup
const toolsPromise = loadAllToolsFromDb();

export async function buildAgent() {
  const tools = await toolsPromise;
  return createReactAgent({ llm, tools });
}

3. Excessive retries hiding real latency

LangChain retries can make cold starts look worse than they are. If you see errors like Error [LangChainError]: Failed to invoke chain followed by repeated attempts, check your retry settings.

const model = new ChatOpenAI({
  apiKey: process.env.OPENAI_API_KEY!,
  model: "gpt-4o-mini",
  maxRetries: 0,
});

Use retries for transient failures, but don’t let them mask initialization problems.

4. Serverless bundle too large

If your Lambda or edge function imports heavy SDKs at the top level, startup time goes up before LangChain even runs.

// ❌ Heavy imports in hot path
import * as fs from "fs";
import * as pdfjs from "pdfjs-dist";
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

Split ingestion jobs from query-time handlers. Don’t ship parsing libraries into the request path unless you need them there.

How to Debug It

  1. Measure where time is spent Add timestamps around object creation and invocation.

    • If new ChatOpenAI() is cheap but vectorStore init is slow, you found your issue.
    • If the first invoke() is slow but later ones are fast, that’s classic cold start behavior.
  2. Log initialization once per process Use a module-level log:

    console.log("Initializing agent runtime");
    

    If this appears on every request, your platform is spinning up fresh instances too often or your code is rebuilding state inside handlers.

  3. Check for per-request constructors Search for these inside route handlers:

    • new ChatOpenAI(...)
    • createRetrievalChain(...)
    • Chroma.fromDocuments(...)
    • PineconeStore.fromExistingIndex(...)

    Anything expensive belongs outside the handler unless it truly changes per request.

  4. Inspect deployment behavior Look at:

    • autoscaling events
    • Lambda concurrent executions
    • container restarts
    • memory pressure causing evictions

    If latency spikes only when replicas increase from zero or one to many, this is not a LangChain issue alone. It’s a runtime warmup problem.

Prevention

  • Initialize LangChain clients and chains at module scope whenever possible.
  • Separate ingestion/indexing jobs from online query handlers.
  • Keep request-time code limited to parsing input and calling .invoke() or .stream().

If you want stable latency under scale, treat LangChain objects like infrastructure, not temporary variables. Build them once, reuse them hard, and keep expensive setup out of the request path.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides