How to Fix 'cold start latency during development' in LangChain (Python)
When you see cold start latency during development in a LangChain Python app, it usually means your chain or agent is doing expensive initialization on the first request. In practice, this shows up when you build clients, load models, or compile prompts inside the request path instead of once at startup.
The symptom is usually not a hard crash. It’s a slow first call, timeouts in local dev, or a long pause before Runnable.invoke() returns.
The Most Common Cause
The #1 cause is creating LangChain objects inside the function that handles each request.
That means every call rebuilds the LLM client, embeddings client, retriever, vector store, or agent graph. In LangChain terms, you’re paying the initialization cost for ChatOpenAI, OpenAIEmbeddings, Chroma, FAISS, or create_react_agent() on every request.
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Build the chain inside the handler | Build once at module load or app startup |
Recreate ChatOpenAI per request | Reuse a singleton client |
| Reload vector store on every call | Load index once and reuse it |
# broken.py
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
app = FastAPI()
@app.get("/answer")
def answer(q: str):
llm = ChatOpenAI(model="gpt-4o-mini") # recreated every request
prompt = ChatPromptTemplate.from_messages([
("system", "You are a support assistant."),
("human", "{question}")
])
chain = prompt | llm
return chain.invoke({"question": q})
# fixed.py
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
app = FastAPI()
llm = ChatOpenAI(model="gpt-4o-mini") # created once
prompt = ChatPromptTemplate.from_messages([
("system", "You are a support assistant."),
("human", "{question}")
])
chain = prompt | llm
@app.get("/answer")
def answer(q: str):
return chain.invoke({"question": q})
If you’re using an agent, the same rule applies:
# broken: agent built per request
from langchain.agents import create_react_agent
@app.get("/agent")
def run_agent(q: str):
agent = create_react_agent(llm=ChatOpenAI(), tools=tools, prompt=prompt)
return agent.invoke({"input": q})
Build the agent once and reuse it.
Other Possible Causes
1) Loading embeddings or vector stores lazily on first query
This is common with Chroma, FAISS, or any retriever backed by disk.
# bad
def search_docs(query: str):
vectorstore = FAISS.load_local("index", embeddings=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
return retriever.invoke(query)
# good
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.load_local("index", embeddings=embeddings)
retriever = vectorstore.as_retriever()
def search_docs(query: str):
return retriever.invoke(query)
2) Running synchronous code in an async server
If you call blocking LangChain methods inside an async endpoint, your “cold start” can look worse than it is.
@app.get("/chat")
async def chat(q: str):
return chain.invoke({"question": q}) # blocking call in async route
Use async-aware methods where available:
@app.get("/chat")
async def chat(q: str):
return await chain.ainvoke({"question": q})
3) Recompiling prompts or parsing schemas repeatedly
If you build structured output parsers or Pydantic schemas per request, startup cost gets pushed into runtime.
def handler(q: str):
parser = PydanticOutputParser(pydantic_object=MySchema)
prompt = PromptTemplate(template=TEMPLATE, input_variables=["q"])
...
Move them out of the hot path:
parser = PydanticOutputParser(pydantic_object=MySchema)
prompt = PromptTemplate(template=TEMPLATE, input_variables=["q"])
4) Network/auth setup happening at import time
This happens when environment variables are missing and LangChain retries provider setup on first use.
export OPENAI_API_KEY=""
export LANGCHAIN_TRACING_V2=true
Or when you instantiate clients before config is ready:
# bad if env vars are loaded later by your app framework
llm = ChatOpenAI()
Load config first, then create clients.
How to Debug It
- •
Time each stage separately
- •Measure import time, client creation time, and first invocation time.
- •If
ChatOpenAI()is fast but.invoke()is slow, the issue is downstream network/model warmup. - •If object creation is slow, you’re rebuilding too much.
- •
Log object construction
- •Add logs around
ChatOpenAI,FAISS.load_local(),Chroma(), andcreate_react_agent(). - •If those logs appear on every request, you found the bug.
- •Add logs around
- •
Check for blocking calls in async routes
- •Search for
.invoke()insideasync def. - •Replace with
.ainvoke()or move work to a worker thread/process.
- •Search for
- •
Isolate one component at a time
- •Comment out retrievers, tools, memory, and output parsers.
- •Re-add them until latency jumps.
- •The last component added is usually the culprit.
Prevention
- •Initialize LangChain clients and chains once at startup, not inside handlers.
- •Keep retrievers, vector stores, and parsers as module-level singletons where practical.
- •Use
.ainvoke()in async apps and benchmark cold vs warm requests separately.
If you want one rule to remember: anything that loads files, connects to a provider, or compiles an agent should happen before traffic hits your endpoint.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit