How to Fix 'deployment crash when scaling' in LangChain (Python)
When a LangChain deployment crashes only after you scale from one worker to many, the usual meaning is simple: your chain or model client is not safe to reuse across concurrent requests, or it depends on local state that disappears once the app starts running in multiple processes. This usually shows up in FastAPI, Celery, Gunicorn, Kubernetes, or serverless deployments where the same code works locally but falls apart under parallel load.
The failure mode is often noisy: RuntimeError: Event loop is closed, ValueError: I/O operation on closed file, openai.BadRequestError, or plain worker exits like Worker exited unexpectedly / OOMKilled. In LangChain terms, the root cause is usually around shared mutable state in LLMChain, ChatOpenAI, vector stores, callback handlers, or memory objects.
The Most Common Cause
The #1 cause is reusing a single LangChain client or chain instance with mutable state across multiple workers or requests.
This breaks most often when people create a global chain with memory, callbacks, or a streaming handler, then scale horizontally. The code looks fine in development because there is only one process and low concurrency.
Broken vs fixed
| Broken pattern | Fixed pattern |
|---|---|
| Global chain/client shared by all requests | Create per-request chain/client or use stateless components |
Mutable ConversationBufferMemory reused globally | Store memory per session in Redis/DB |
| Streaming callback handler reused across requests | Instantiate handlers inside request scope |
# BROKEN
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain_core.prompts import PromptTemplate
app = FastAPI()
llm = ChatOpenAI(model="gpt-4o-mini")
memory = ConversationBufferMemory(return_messages=True)
prompt = PromptTemplate.from_template(
"Answer the user question.\nHistory: {history}\nQuestion: {question}"
)
chain = LLMChain(
llm=llm,
prompt=prompt,
memory=memory,
)
@app.post("/chat")
def chat(payload: dict):
# Under load this can mix state between users / workers.
return chain.invoke({"question": payload["question"]})
# FIXED
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain_core.prompts import PromptTemplate
app = FastAPI()
prompt = PromptTemplate.from_template(
"Answer the user question.\nHistory: {history}\nQuestion: {question}"
)
def build_chain(session_id: str):
llm = ChatOpenAI(model="gpt-4o-mini")
memory = ConversationBufferMemory(return_messages=True)
return LLMChain(llm=llm, prompt=prompt, memory=memory)
@app.post("/chat")
def chat(payload: dict):
chain = build_chain(payload["session_id"])
return chain.invoke({"question": payload["question"]})
If you need real conversation history, do not keep it in process memory. Put it in Redis, Postgres, DynamoDB, or another external store keyed by session ID.
Other Possible Causes
1) Event loop mismatch in async deployments
If you call sync LangChain code inside an async app under load, you can hit errors like:
- •
RuntimeError: Event loop is closed - •
RuntimeError: This event loop is already running
This happens when using .invoke() from async routes with blocking clients or mixing sync and async OpenAI calls.
# BAD
@app.post("/ask")
async def ask(payload: dict):
return chain.invoke({"question": payload["question"]})
# GOOD
@app.post("/ask")
async def ask(payload: dict):
return await chain.ainvoke({"question": payload["question"]})
2) Worker count exposes hidden rate limits
Scaling often means more concurrent calls to OpenAI or another provider. Then you start seeing:
- •
openai.RateLimitError - •
429 Too Many Requests - •retries piling up until the worker dies
Fix this by limiting concurrency and adding backoff.
from tenacity import retry, wait_exponential, stop_after_attempt
@retry(wait=wait_exponential(min=1, max=10), stop=stop_after_attempt(5))
def call_llm(question: str):
return chain.invoke({"question": question})
Also cap request fan-out if you use agents or parallel tools.
3) Memory blow-up from large prompts or retrieved context
A single pod may survive locally but crash at scale because every worker loads large embeddings indexes or builds huge prompts. Common symptoms:
- •container restarts with
OOMKilled - •Gunicorn worker timeout
- •
Killedin logs
Bad pattern:
# Loading everything into memory at startup.
vectorstore = FAISS.load_local("./big_index", embeddings)
Better pattern:
# Load once per pod only if size is acceptable,
# otherwise use a managed vector DB.
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
If your index is large, move it out of process entirely.
4) Callback handlers writing to shared files/stdout incorrectly
Streaming handlers and custom callbacks can crash when multiple workers write to the same file handle.
# BAD
log_file = open("/tmp/langchain.log", "a")
handler = MyStreamingHandler(log_file)
Use structured logging per process and avoid sharing open handles across requests.
How to Debug It
- •
Check whether the crash only happens with more than one worker
- •Run with one worker first:
- •
uvicorn app:app --workers 1 - •Then try 2+ workers:
- •
gunicorn -w 4 -k uvicorn.workers.UvicornWorker app:app
- •
Look for shared globals
- •Search for module-level instances of:
- •
ChatOpenAI - •
LLMChain - •
ConversationBufferMemory - •retrievers / vector stores / callback handlers
- •
- •If they are global and mutable, assume they are suspect.
- •Search for module-level instances of:
- •
Switch sync calls to async equivalents
- •Replace
.invoke()with.ainvoke()inside async routes. - •Replace blocking network/file work with async-safe versions where possible.
- •If the crash disappears, you had an event loop / blocking issue.
- •Replace
- •
Inspect pod and worker logs for resource pressure
- •Kubernetes:
- •check for
OOMKilled, readiness probe failures, restart loops
- •check for
- •Gunicorn:
- •check for worker timeout messages
- •Provider logs:
- •look for 429s and retries stacking up
- •Kubernetes:
Prevention
- •Keep LangChain chains stateless at the request boundary.
- •Treat memory as external storage keyed by session ID.
- •Instantiate request-scoped handlers and clients when they carry mutable state.
- •Load heavy assets from managed services instead of process memory when scaling beyond one instance.
- •Add load tests before deployment:
- •test one worker vs four workers
- •test 10 concurrent requests vs 100 concurrent requests
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit