How to Fix 'cold start latency in production' in LangChain (TypeScript)
When people say they’re seeing “cold start latency in production” with LangChain TypeScript, they usually mean the first request after deploy or idle time is much slower than the rest. In practice, this shows up as slow first-token time, long Lambda/Cloud Run startup, or a chain that feels fine locally but stalls under real traffic.
The root cause is usually not LangChain itself. It’s almost always how the app initializes models, embeddings, vector stores, or serverless containers.
The Most Common Cause
The #1 cause is creating LLMs, embeddings, retrievers, or vector stores inside the request path instead of initializing them once and reusing them.
This is especially bad in serverless environments because every cold boot repeats expensive setup: SDK auth, network handshakes, model client creation, and vector index loading.
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
Initializes ChatOpenAI and MemoryVectorStore per request | Initializes once at module scope and reuses |
| Rebuilds embeddings on every call | Reuses embeddings client |
Causes first request to hit RunnableSequence, ChatOpenAI, and retriever setup all at once | Keeps request path thin |
// ❌ Broken: everything happens inside the handler
import { ChatOpenAI } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RunnableSequence } from "@langchain/core/runnables";
export async function POST(req: Request) {
const body = await req.json();
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
const embeddings = new OpenAIEmbeddings();
const store = await MemoryVectorStore.fromTexts(
["policy A", "policy B"],
[{ id: 1 }, { id: 2 }],
embeddings
);
const chain = RunnableSequence.from([
async (input: string) => {
const docs = await store.similaritySearch(input, 2);
return docs.map((d) => d.pageContent).join("\n");
},
llm,
]);
const result = await chain.invoke(body.question);
return Response.json({ result });
}
// ✅ Fixed: initialize once and reuse across requests
import { ChatOpenAI } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RunnableSequence } from "@langchain/core/runnables";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
const embeddings = new OpenAIEmbeddings();
const storePromise = MemoryVectorStore.fromTexts(
["policy A", "policy B"],
[{ id: 1 }, { id: 2 }],
embeddings
);
export async function POST(req: Request) {
const body = await req.json();
const store = await storePromise;
const chain = RunnableSequence.from([
async (input: string) => {
const docs = await store.similaritySearch(input, 2);
return docs.map((d) => d.pageContent).join("\n");
},
llm,
]);
const result = await chain.invoke(body.question);
return Response.json({ result });
}
If you’re on AWS Lambda, Vercel Functions, or Cloud Run, this difference is huge. Module-scope initialization lets warm invocations skip repeated setup.
Other Possible Causes
1. Expensive prompt assembly on every request
If you’re reading large templates from disk or generating prompts dynamically, that work adds up.
// ❌ Reads and builds every request
const template = await fs.promises.readFile("prompt.txt", "utf8");
const prompt = PromptTemplate.fromTemplate(template);
// ✅ Load once
const template = await fs.promises.readFile("prompt.txt", "utf8");
const prompt = PromptTemplate.fromTemplate(template);
The fix is boring but effective: load static assets at startup.
2. Cold vector store / retriever hydration
A common LangChain stack uses VectorStoreRetriever. If you rebuild the index or reload documents on each invocation, the first call will crawl.
// ❌ Hydrates documents every time
const docs = await loadDocs();
const store = await Chroma.fromDocuments(docs, embeddings);
const retriever = store.asRetriever();
// ✅ Build once during boot
const retrieverPromise = (async () => {
const docs = await loadDocs();
const store = await Chroma.fromDocuments(docs, embeddings);
return store.asRetriever();
})();
If your production data changes often, rebuild out of band and swap the index atomically.
3. Network handshakes with remote model providers
ChatOpenAI, Anthropic clients, and other provider SDKs can add noticeable latency on first use if you create them lazily inside handlers.
// ❌ Client created only when traffic arrives
export async function handler() {
const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
}
// ✅ Create client at module load
const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
Also check whether your runtime has outbound DNS delays or VPC egress issues. Those look like LangChain slowness but aren’t.
4. Using streaming without warming the path
If your first token is delayed but the rest flows normally, your issue may be stream setup rather than generation time.
const stream = await llm.stream(messages); // first token delayed by upstream setup
Make sure you’re measuring:
- •time to handler start
- •time to
invoke - •time to first token
- •total completion time
How to Debug It
- •
Measure startup separately from inference
- •Add timestamps around module init and inside the route handler.
- •If init is slow but requests are fast afterward, it’s a cold-start problem.
- •
Log which LangChain classes are created per request
- •Look for repeated creation of:
- •
ChatOpenAI - •
OpenAIEmbeddings - •
MemoryVectorStore - •
RunnableSequence - •
ConversationalRetrievalQAChain
- •
- •Anything expensive should usually live outside the handler.
- •Look for repeated creation of:
- •
Disable everything except one hop
- •Call the model directly with a minimal prompt.
- •Then add retriever logic.
- •Then add document loading.
- •The step that makes latency jump is your culprit.
- •
Check runtime-specific cold start behavior
- •Serverless functions may freeze containers after idle periods.
- •Edge runtimes can have different networking constraints.
- •Container platforms may scale to zero under low traffic.
Prevention
- •Initialize LangChain clients and retrievers at module scope whenever possible.
- •Prebuild vector indexes and prompt assets during deploy or background jobs.
- •Add latency metrics for:
- •container start
- •chain construction
- •retrieval time
- •first token time
If you treat LangChain like a per-request factory instead of a reusable runtime component, you’ll keep paying cold-start tax forever. Keep initialization out of the hot path and your production latency will stop looking random.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit