How to Fix 'cold start latency in production' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21
cold-start-latency-in-productionlangchainpython

Cold start latency in production usually means your first request is paying the cost of loading models, initializing clients, building chains, or warming up vector stores. In LangChain Python apps, this shows up most often after deploys, cold Lambda starts, or when every request rebuilds the same objects instead of reusing them.

The Most Common Cause

The #1 cause is initializing expensive LangChain objects inside the request path. That includes ChatOpenAI, embeddings, vector stores, retrievers, and ConversationalRetrievalChain construction on every call.

Here’s the broken pattern:

# broken.py
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain_community.vectorstores import FAISS

app = FastAPI()

@app.post("/chat")
def chat(payload: dict):
    llm = ChatOpenAI(model="gpt-4o-mini")  # created per request
    vectorstore = FAISS.load_local("index", embeddings=None)  # expensive and wrong place
    retriever = vectorstore.as_retriever()
    chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever)

    return chain.invoke({"question": payload["question"], "chat_history": []})

And here’s the right pattern:

# fixed.py
from fastapi import FastAPI
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain_community.vectorstores import FAISS

app = FastAPI()

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.load_local(
    "index",
    embeddings,
    allow_dangerous_deserialization=True,
)
retriever = vectorstore.as_retriever()
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever)

@app.post("/chat")
def chat(payload: dict):
    return chain.invoke({"question": payload["question"], "chat_history": []})

If you’re seeing logs like:

  • INFO langchain_core.callbacks...
  • RunnableSequence.invoke()
  • FAISS.load_local(...)
  • slow first response only after deploy

then you’re almost certainly rebuilding runtime state too late.

Other Possible Causes

1) Cold model provider startup

Even if your code is fine, the first call to ChatOpenAI can be slow because the provider side is cold or network paths are not warmed.

# config snippet
llm = ChatOpenAI(
    model="gpt-4o-mini",
    timeout=30,
    max_retries=2,
)

What helps:

  • Send a warm-up request at startup
  • Keep a singleton client
  • Avoid creating new HTTP sessions per request

2) Loading embeddings or vector stores from disk on every request

This is common with FAISS.load_local, Chroma(persist_directory=...), or remote stores initialized inside handlers.

# broken
@app.get("/search")
def search(q: str):
    embeddings = OpenAIEmbeddings()
    store = FAISS.load_local("index", embeddings, allow_dangerous_deserialization=True)
    return store.similarity_search(q)

Fix it by loading once:

embeddings = OpenAIEmbeddings()
store = FAISS.load_local("index", embeddings, allow_dangerous_deserialization=True)

@app.get("/search")
def search(q: str):
    return store.similarity_search(q)

3) Lazy imports in hot paths

Python imports are not free. If you import LangChain integrations inside a handler, you pay that cost on first hit.

# broken
@app.post("/answer")
def answer(payload: dict):
    from langchain_openai import ChatOpenAI
    from langchain_core.prompts import ChatPromptTemplate
    ...

Move imports to module scope:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

This matters more in serverless deployments where every millisecond counts.


4) Building chains dynamically for every request

If you create a new RunnableSequence, LLMChain, or ConversationalRetrievalChain each time, you add avoidable overhead.

# broken
def build_chain():
    prompt = ChatPromptTemplate.from_template("Answer: {question}")
    return prompt | llm

@app.post("/ask")
def ask(payload: dict):
    chain = build_chain()
    return chain.invoke({"question": payload["question"]})

Prefer one shared chain:

prompt = ChatPromptTemplate.from_template("Answer: {question}")
chain = prompt | llm

@app.post("/ask")
def ask(payload: dict):
    return chain.invoke({"question": payload["question"]})

How to Debug It

  1. Measure startup vs request latency separately

    • Log timestamps during app boot and inside the handler.
    • If boot is fast but first request is slow, your initialization is happening lazily.
  2. Print object creation points

    • Add logs around ChatOpenAI(...), FAISS.load_local(...), and chain construction.
    • If those logs appear on every request, that’s your bug.
  3. Profile the first request only

    • Use cProfile or simple timing wrappers.
    • Look for slow calls to:
      • load_local
      • embedding generation
      • retriever creation
      • remote API client initialization
  4. Check deployment behavior

    • In Lambda, Cloud Run min instances, Gunicorn workers, or Kubernetes pods, confirm whether each worker reloads state.
    • A “cold start” may actually be “one cold start per worker.”

Example timing wrapper:

import time

start = time.perf_counter()
llm = ChatOpenAI(model="gpt-4o-mini")
print(f"LLM init: {time.perf_counter() - start:.3f}s")

start = time.perf_counter()
store = FAISS.load_local("index", embeddings, allow_dangerous_deserialization=True)
print(f"Vector store load: {time.perf_counter() - start:.3f}s")

Prevention

  • Initialize LangChain clients, retrievers, and chains at module load or app startup, not inside request handlers.
  • Reuse embeddings and vector stores across requests; treat them as application singletons.
  • Add startup warm-up calls for production deployments that scale to zero.
  • Put timing logs around chain construction so regressions show up before users do.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides