How to Fix 'intermittent 500 errors when scaling' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21
intermittent-500-errors-when-scalinglangchainpython

Intermittent 500 errors during scaling usually mean your LangChain app is failing under concurrency, not that the model itself is “randomly broken.” In Python, this shows up when you add more workers, more threads, or more concurrent requests and suddenly start seeing Internal Server Error, RuntimeError, timeouts, or upstream API failures.

In practice, the root cause is usually shared mutable state: one chain, one client, one memory object, or one callback handler being reused across requests without isolation.

The Most Common Cause

The #1 cause is reusing a non-thread-safe LangChain object across concurrent requests.

A common anti-pattern is creating one global chain and calling it from multiple request handlers at the same time. That can work locally with one user, then fail intermittently once you scale to multiple workers or async requests.

Broken pattern vs fixed pattern

BrokenFixed
Shared chain instance reused across requestsCreate per-request chain or use stateless components
Shared mutable memoryIsolate memory per session/request
Unsafe callback handler reuseInstantiate callbacks per run
# BROKEN: shared mutable chain across concurrent requests
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory

app = FastAPI()

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferMemory(return_messages=True)

prompt = PromptTemplate.from_template(
    "Answer the question using chat history.\n\nHistory: {history}\nQuestion: {question}"
)

chain = LLMChain(
    llm=llm,
    prompt=prompt,
    memory=memory,
)

@app.post("/ask")
async def ask(payload: dict):
    # Under load this can race on shared memory/state
    result = await chain.ainvoke({"question": payload["question"]})
    return {"answer": result["text"]}
# FIXED: build request-scoped state instead of sharing it globally
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory

app = FastAPI()

prompt = PromptTemplate.from_template(
    "Answer the question using chat history.\n\nHistory: {history}\nQuestion: {question}"
)

@app.post("/ask")
async def ask(payload: dict):
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    memory = ConversationBufferMemory(return_messages=True)

    chain = LLMChain(
        llm=llm,
        prompt=prompt,
        memory=memory,
    )

    result = await chain.ainvoke({"question": payload["question"]})
    return {"answer": result["text"]}

If you need conversation state, store it outside the chain in Redis/Postgres and load it per request. Do not share a single in-memory ConversationBufferMemory instance across users.

Other Possible Causes

1) Too much concurrency for your upstream model provider

When scaling horizontally, you may also scale request bursts into rate limits. The symptom is often a 500 in your app wrapping an upstream 429, 503, or timeout.

# Example of unsafe fan-out
results = await asyncio.gather(*[chain.ainvoke({"question": q}) for q in questions])

Fix it by limiting concurrency:

sem = asyncio.Semaphore(5)

async def safe_call(q):
    async with sem:
        return await chain.ainvoke({"question": q})

results = await asyncio.gather(*[safe_call(q) for q in questions])

2) Reusing a streaming callback handler across requests

StreamingStdOutCallbackHandler and custom handlers can hold state. If one handler instance is shared globally, token streams from different users can collide.

# BAD: global handler reused everywhere
handler = StreamingStdOutCallbackHandler()
llm = ChatOpenAI(callbacks=[handler])

Use a fresh handler per request:

@app.post("/stream")
async def stream(payload: dict):
    handler = StreamingStdOutCallbackHandler()
    llm = ChatOpenAI(callbacks=[handler])

3) Sync code inside async endpoints causing event loop stalls

If your LangChain call path uses blocking I/O inside async def, workers can stall under load and trigger gateway timeouts that look like random 500s.

# BAD: blocking call in async route
@app.post("/ask")
async def ask(payload: dict):
    result = chain.invoke({"question": payload["question"]})
    return result

Use the async API:

@app.post("/ask")
async def ask(payload: dict):
    result = await chain.ainvoke({"question": payload["question"]})
    return result

4) Client/session objects not safe to share across workers

Some SDK clients are fine to reuse; others are not, especially when wrapped by custom transport logic or patched middleware. If you see errors like:

  • RuntimeError: Event loop is closed
  • httpx.ReadTimeout
  • openai.APIConnectionError
  • langchain_core.exceptions.OutputParserException

check whether the client was created once at module import time and then reused after process forks.

How to Debug It

  1. Log the full exception chain

    • Don’t stop at 500 Internal Server Error.
    • Capture the underlying exception type and message.
    • Look for RateLimitError, TimeoutError, OutputParserException, or RuntimeError.
  2. Disable concurrency and reproduce

    • Run with one worker.
    • Set semaphore limits to 1.
    • If the error disappears, it’s almost certainly shared state or overload.
  3. Remove memory and callbacks first

    • Replace ConversationBufferMemory with a stateless prompt.
    • Remove custom callbacks and streaming handlers.
    • If stability returns, add components back one by one.
  4. Test process model differences

    • Compare local dev vs Gunicorn/Uvicorn with multiple workers.
    • A bug that only appears with --workers 4 usually points to forked client state or non-isolated globals.

A practical logging pattern:

import logging

logger = logging.getLogger(__name__)

try:
    result = await chain.ainvoke({"question": question})
except Exception as e:
    logger.exception("LangChain request failed")
    raise

That gives you the real stack trace instead of a generic server error.

Prevention

  • Keep chains stateless where possible.

    • Pass inputs in, get outputs out.
    • Store conversation/session data externally by user ID.
  • Instantiate request-scoped objects inside handlers.

    • Memory, callbacks, temporary retrievers, and per-request config should not be global unless they are explicitly thread-safe.
  • Put concurrency limits around downstream calls.

    • Use semaphores, queues, or worker pools.
    • Treat model APIs like any other rate-limited dependency.

If you’re seeing intermittent 500s only after scaling LangChain in Python, assume shared state first, not “random infra noise.” Fix isolation before chasing model quality or prompt changes.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides