How to Fix 'streaming response cutoff when scaling' in LangChain (Python)
If you’re seeing streaming response cutoff when scaling in a LangChain Python app, it usually means your stream is getting interrupted before the full model output reaches the client. In practice, this shows up when you move from one local worker to multiple replicas, or when a reverse proxy, timeout, or websocket layer can’t keep the stream alive long enough.
The key thing: this is usually not a “LangChain bug” by itself. It’s almost always an infrastructure or streaming pattern issue around ChatOpenAI.stream(), astream(), StreamingStdOutCallbackHandler, SSE, or websocket delivery.
The Most Common Cause
The #1 cause is running a streaming endpoint behind multiple app workers without sticky session handling or without a transport that supports long-lived chunked responses.
A common broken pattern is starting the stream in one process and expecting the client to stay attached while load balancing moves traffic or kills the worker.
| Broken pattern | Fixed pattern |
|---|---|
| Stream from a normal HTTP handler behind multiple workers with short timeouts | Use SSE/websockets with timeout-aware infra and one request pinned to one worker |
| Let Gunicorn/Uvicorn default worker behavior handle long streams | Configure worker timeouts and proxy buffering explicitly |
Broken code
# app.py
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
app = FastAPI()
llm = ChatOpenAI(model="gpt-4o-mini", streaming=True)
@app.get("/chat")
async def chat():
chunks = []
async for chunk in llm.astream("Explain RFC 9110 in 3 bullets"):
chunks.append(chunk.content)
return {"text": "".join(chunks)}
This looks fine, but it is not actually streaming to the client. You buffer everything server-side, and if scaling introduces latency or worker restarts, the client sees truncated output or a timeout.
Fixed code
# app.py
from fastapi import FastAPI
from sse_starlette.sse import EventSourceResponse
from langchain_openai import ChatOpenAI
app = FastAPI()
llm = ChatOpenAI(model="gpt-4o-mini", streaming=True)
@app.get("/chat")
async def chat():
async def event_generator():
async for chunk in llm.astream("Explain RFC 9110 in 3 bullets"):
if chunk.content:
yield {"event": "token", "data": chunk.content}
yield {"event": "done", "data": "[DONE]"}
return EventSourceResponse(event_generator())
That change matters because now the response stays open as an actual stream. If you’re behind Nginx, ALB, Cloudflare, or an API gateway, you also need buffering and idle timeout settings aligned with that behavior.
Other Possible Causes
1) Proxy buffering is swallowing chunks
Nginx often buffers upstream responses unless you disable it.
location /chat {
proxy_pass http://app;
proxy_buffering off;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
If buffering stays on, the client may see nothing until the response ends, then think the stream was cut off when the worker times out.
2) Worker timeout is too low
Gunicorn and Uvicorn workers can kill long-lived requests.
gunicorn app:app \
-k uvicorn.workers.UvicornWorker \
--workers 4 \
--timeout 30
For LLM streams, --timeout 30 is often too aggressive. Raise it:
gunicorn app:app \
-k uvicorn.workers.UvicornWorker \
--workers 4 \
--timeout 180
3) You’re using sync callbacks in an async stream
Mixing invoke() with async endpoints can block the event loop and stall token delivery.
# bad
result = llm.invoke("Write a summary")
# better
async for chunk in llm.astream("Write a summary"):
...
If you are using StreamingStdOutCallbackHandler inside an async server, make sure it isn’t doing blocking I/O on every token.
4) The model call is fine, but your client disconnects early
Browsers, mobile clients, and frontend fetch wrappers can cancel streams on navigation or retry logic.
const controller = new AbortController();
fetch("/chat", { signal: controller.signal })
If your frontend aborts after a few seconds, LangChain will look like it “cut off” even though the server was still producing tokens.
How to Debug It
- •
Check whether LangChain is actually streaming
- •Add logging around token emission.
- •If you use callbacks like
StreamingStdOutCallbackHandler, verify tokens are arriving continuously. - •If you only see output at the end, your code is buffering.
- •
Bypass your proxy/load balancer
- •Hit Uvicorn directly on localhost.
- •If the problem disappears, your issue is Nginx/ALB/API Gateway buffering or timeout settings.
- •
Inspect server termination logs
- •Look for messages like:
- •
Worker timeout (pid: ...) - •
ClientDisconnect - •
BrokenPipeError - •
asyncio.CancelledError
- •
- •These usually tell you whether the server died or the client bailed first.
- •Look for messages like:
- •
Reduce moving parts
- •Test with one worker:
uvicorn app:app --workers 1 --timeout-keep-alive 75 - •Then reintroduce scaling.
- •If it only fails at >1 worker, suspect load balancing affinity or shared state in your stream handler.
- •Test with one worker:
Prevention
- •Use real streaming transport:
- •SSE for browser clients
- •WebSockets for bidirectional interactions
- •Set infra timeouts explicitly:
- •proxy read/send timeouts
- •worker timeouts
- •keep-alive settings
- •Keep streaming handlers stateless:
- •no shared mutable buffers across requests
- •no global token accumulators unless they’re keyed per request
If you want this to survive scaling, treat LangChain as only one layer of the pipeline. The cutoff usually happens in ASGI workers, proxies, or clients—not in ChatOpenAI itself.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit