How to Fix 'intermittent 500 errors when scaling' in LangChain (Python)
Intermittent 500 errors during scaling usually mean your LangChain app is failing under concurrency, not that the model itself is “randomly broken.” In Python, this shows up when you add more workers, more threads, or more concurrent requests and suddenly start seeing Internal Server Error, RuntimeError, timeouts, or upstream API failures.
In practice, the root cause is usually shared mutable state: one chain, one client, one memory object, or one callback handler being reused across requests without isolation.
The Most Common Cause
The #1 cause is reusing a non-thread-safe LangChain object across concurrent requests.
A common anti-pattern is creating one global chain and calling it from multiple request handlers at the same time. That can work locally with one user, then fail intermittently once you scale to multiple workers or async requests.
Broken pattern vs fixed pattern
| Broken | Fixed |
|---|---|
| Shared chain instance reused across requests | Create per-request chain or use stateless components |
| Shared mutable memory | Isolate memory per session/request |
| Unsafe callback handler reuse | Instantiate callbacks per run |
# BROKEN: shared mutable chain across concurrent requests
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
app = FastAPI()
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferMemory(return_messages=True)
prompt = PromptTemplate.from_template(
"Answer the question using chat history.\n\nHistory: {history}\nQuestion: {question}"
)
chain = LLMChain(
llm=llm,
prompt=prompt,
memory=memory,
)
@app.post("/ask")
async def ask(payload: dict):
# Under load this can race on shared memory/state
result = await chain.ainvoke({"question": payload["question"]})
return {"answer": result["text"]}
# FIXED: build request-scoped state instead of sharing it globally
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
app = FastAPI()
prompt = PromptTemplate.from_template(
"Answer the question using chat history.\n\nHistory: {history}\nQuestion: {question}"
)
@app.post("/ask")
async def ask(payload: dict):
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferMemory(return_messages=True)
chain = LLMChain(
llm=llm,
prompt=prompt,
memory=memory,
)
result = await chain.ainvoke({"question": payload["question"]})
return {"answer": result["text"]}
If you need conversation state, store it outside the chain in Redis/Postgres and load it per request. Do not share a single in-memory ConversationBufferMemory instance across users.
Other Possible Causes
1) Too much concurrency for your upstream model provider
When scaling horizontally, you may also scale request bursts into rate limits. The symptom is often a 500 in your app wrapping an upstream 429, 503, or timeout.
# Example of unsafe fan-out
results = await asyncio.gather(*[chain.ainvoke({"question": q}) for q in questions])
Fix it by limiting concurrency:
sem = asyncio.Semaphore(5)
async def safe_call(q):
async with sem:
return await chain.ainvoke({"question": q})
results = await asyncio.gather(*[safe_call(q) for q in questions])
2) Reusing a streaming callback handler across requests
StreamingStdOutCallbackHandler and custom handlers can hold state. If one handler instance is shared globally, token streams from different users can collide.
# BAD: global handler reused everywhere
handler = StreamingStdOutCallbackHandler()
llm = ChatOpenAI(callbacks=[handler])
Use a fresh handler per request:
@app.post("/stream")
async def stream(payload: dict):
handler = StreamingStdOutCallbackHandler()
llm = ChatOpenAI(callbacks=[handler])
3) Sync code inside async endpoints causing event loop stalls
If your LangChain call path uses blocking I/O inside async def, workers can stall under load and trigger gateway timeouts that look like random 500s.
# BAD: blocking call in async route
@app.post("/ask")
async def ask(payload: dict):
result = chain.invoke({"question": payload["question"]})
return result
Use the async API:
@app.post("/ask")
async def ask(payload: dict):
result = await chain.ainvoke({"question": payload["question"]})
return result
4) Client/session objects not safe to share across workers
Some SDK clients are fine to reuse; others are not, especially when wrapped by custom transport logic or patched middleware. If you see errors like:
- •
RuntimeError: Event loop is closed - •
httpx.ReadTimeout - •
openai.APIConnectionError - •
langchain_core.exceptions.OutputParserException
check whether the client was created once at module import time and then reused after process forks.
How to Debug It
- •
Log the full exception chain
- •Don’t stop at
500 Internal Server Error. - •Capture the underlying exception type and message.
- •Look for
RateLimitError,TimeoutError,OutputParserException, orRuntimeError.
- •Don’t stop at
- •
Disable concurrency and reproduce
- •Run with one worker.
- •Set semaphore limits to 1.
- •If the error disappears, it’s almost certainly shared state or overload.
- •
Remove memory and callbacks first
- •Replace
ConversationBufferMemorywith a stateless prompt. - •Remove custom callbacks and streaming handlers.
- •If stability returns, add components back one by one.
- •Replace
- •
Test process model differences
- •Compare local dev vs Gunicorn/Uvicorn with multiple workers.
- •A bug that only appears with
--workers 4usually points to forked client state or non-isolated globals.
A practical logging pattern:
import logging
logger = logging.getLogger(__name__)
try:
result = await chain.ainvoke({"question": question})
except Exception as e:
logger.exception("LangChain request failed")
raise
That gives you the real stack trace instead of a generic server error.
Prevention
- •
Keep chains stateless where possible.
- •Pass inputs in, get outputs out.
- •Store conversation/session data externally by user ID.
- •
Instantiate request-scoped objects inside handlers.
- •Memory, callbacks, temporary retrievers, and per-request config should not be global unless they are explicitly thread-safe.
- •
Put concurrency limits around downstream calls.
- •Use semaphores, queues, or worker pools.
- •Treat model APIs like any other rate-limited dependency.
If you’re seeing intermittent 500s only after scaling LangChain in Python, assume shared state first, not “random infra noise.” Fix isolation before chasing model quality or prompt changes.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit