How to Fix 'deployment crash when scaling' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21
deployment-crash-when-scalinglangchainpython

When a LangChain deployment crashes only after you scale from one worker to many, the usual meaning is simple: your chain or model client is not safe to reuse across concurrent requests, or it depends on local state that disappears once the app starts running in multiple processes. This usually shows up in FastAPI, Celery, Gunicorn, Kubernetes, or serverless deployments where the same code works locally but falls apart under parallel load.

The failure mode is often noisy: RuntimeError: Event loop is closed, ValueError: I/O operation on closed file, openai.BadRequestError, or plain worker exits like Worker exited unexpectedly / OOMKilled. In LangChain terms, the root cause is usually around shared mutable state in LLMChain, ChatOpenAI, vector stores, callback handlers, or memory objects.

The Most Common Cause

The #1 cause is reusing a single LangChain client or chain instance with mutable state across multiple workers or requests.

This breaks most often when people create a global chain with memory, callbacks, or a streaming handler, then scale horizontally. The code looks fine in development because there is only one process and low concurrency.

Broken vs fixed

Broken patternFixed pattern
Global chain/client shared by all requestsCreate per-request chain/client or use stateless components
Mutable ConversationBufferMemory reused globallyStore memory per session in Redis/DB
Streaming callback handler reused across requestsInstantiate handlers inside request scope
# BROKEN
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain_core.prompts import PromptTemplate

app = FastAPI()

llm = ChatOpenAI(model="gpt-4o-mini")
memory = ConversationBufferMemory(return_messages=True)

prompt = PromptTemplate.from_template(
    "Answer the user question.\nHistory: {history}\nQuestion: {question}"
)

chain = LLMChain(
    llm=llm,
    prompt=prompt,
    memory=memory,
)

@app.post("/chat")
def chat(payload: dict):
    # Under load this can mix state between users / workers.
    return chain.invoke({"question": payload["question"]})
# FIXED
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain_core.prompts import PromptTemplate

app = FastAPI()

prompt = PromptTemplate.from_template(
    "Answer the user question.\nHistory: {history}\nQuestion: {question}"
)

def build_chain(session_id: str):
    llm = ChatOpenAI(model="gpt-4o-mini")
    memory = ConversationBufferMemory(return_messages=True)
    return LLMChain(llm=llm, prompt=prompt, memory=memory)

@app.post("/chat")
def chat(payload: dict):
    chain = build_chain(payload["session_id"])
    return chain.invoke({"question": payload["question"]})

If you need real conversation history, do not keep it in process memory. Put it in Redis, Postgres, DynamoDB, or another external store keyed by session ID.

Other Possible Causes

1) Event loop mismatch in async deployments

If you call sync LangChain code inside an async app under load, you can hit errors like:

  • RuntimeError: Event loop is closed
  • RuntimeError: This event loop is already running

This happens when using .invoke() from async routes with blocking clients or mixing sync and async OpenAI calls.

# BAD
@app.post("/ask")
async def ask(payload: dict):
    return chain.invoke({"question": payload["question"]})
# GOOD
@app.post("/ask")
async def ask(payload: dict):
    return await chain.ainvoke({"question": payload["question"]})

2) Worker count exposes hidden rate limits

Scaling often means more concurrent calls to OpenAI or another provider. Then you start seeing:

  • openai.RateLimitError
  • 429 Too Many Requests
  • retries piling up until the worker dies

Fix this by limiting concurrency and adding backoff.

from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(min=1, max=10), stop=stop_after_attempt(5))
def call_llm(question: str):
    return chain.invoke({"question": question})

Also cap request fan-out if you use agents or parallel tools.

3) Memory blow-up from large prompts or retrieved context

A single pod may survive locally but crash at scale because every worker loads large embeddings indexes or builds huge prompts. Common symptoms:

  • container restarts with OOMKilled
  • Gunicorn worker timeout
  • Killed in logs

Bad pattern:

# Loading everything into memory at startup.
vectorstore = FAISS.load_local("./big_index", embeddings)

Better pattern:

# Load once per pod only if size is acceptable,
# otherwise use a managed vector DB.
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

If your index is large, move it out of process entirely.

4) Callback handlers writing to shared files/stdout incorrectly

Streaming handlers and custom callbacks can crash when multiple workers write to the same file handle.

# BAD
log_file = open("/tmp/langchain.log", "a")

handler = MyStreamingHandler(log_file)

Use structured logging per process and avoid sharing open handles across requests.

How to Debug It

  1. Check whether the crash only happens with more than one worker

    • Run with one worker first:
    • uvicorn app:app --workers 1
    • Then try 2+ workers:
    • gunicorn -w 4 -k uvicorn.workers.UvicornWorker app:app
  2. Look for shared globals

    • Search for module-level instances of:
      • ChatOpenAI
      • LLMChain
      • ConversationBufferMemory
      • retrievers / vector stores / callback handlers
    • If they are global and mutable, assume they are suspect.
  3. Switch sync calls to async equivalents

    • Replace .invoke() with .ainvoke() inside async routes.
    • Replace blocking network/file work with async-safe versions where possible.
    • If the crash disappears, you had an event loop / blocking issue.
  4. Inspect pod and worker logs for resource pressure

    • Kubernetes:
      • check for OOMKilled, readiness probe failures, restart loops
    • Gunicorn:
      • check for worker timeout messages
    • Provider logs:
      • look for 429s and retries stacking up

Prevention

  • Keep LangChain chains stateless at the request boundary.
    • Treat memory as external storage keyed by session ID.
  • Instantiate request-scoped handlers and clients when they carry mutable state.
  • Load heavy assets from managed services instead of process memory when scaling beyond one instance.
  • Add load tests before deployment:
    • test one worker vs four workers
    • test 10 concurrent requests vs 100 concurrent requests

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides