How to Fix 'streaming response cutoff when scaling' in LangChain (TypeScript)
When you see streaming response cutoff when scaling, it usually means your stream started fine, then got terminated before the model finished emitting tokens. In LangChain TypeScript, this shows up most often when you move from a single-process dev setup to multiple workers, serverless, or a load balancer that doesn’t preserve long-lived connections.
The key point: this is usually not a LangChain “bug” in the model layer. It’s almost always an infrastructure or streaming lifecycle problem around Runnable.stream(), ChatOpenAI.stream(), SSE, or WebSocket handling.
The Most Common Cause
The #1 cause is that your streaming request is being routed through an execution environment that can’t keep the connection alive long enough, or it’s being buffered by a proxy/load balancer. In practice, this happens with:
- •serverless functions with short execution windows
- •reverse proxies buffering chunked responses
- •multiple app instances without sticky sessions
- •request handlers that return before the stream is fully consumed
Here’s the broken pattern I see most often.
| Broken | Fixed |
|---|---|
| Stream is started inside a request handler, but the response lifecycle is not tied to the stream | Keep the HTTP connection open until the stream completes |
Uses res.json() after starting streaming | Uses SSE/chunked transfer and writes each token as it arrives |
| Works locally, fails behind Nginx / ALB / Vercel | Configures proxy buffering off and uses a runtime that supports long-lived connections |
// Broken: starts streaming but doesn't properly hold the HTTP response open.
import { ChatOpenAI } from "@langchain/openai";
export async function POST(req: Request) {
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
streaming: true,
});
const stream = await llm.stream("Write a short summary of risk controls.");
// This looks like it should work, but in many deployments
// the response gets cut off because nothing is actually piping
// tokens to the client.
return new Response(JSON.stringify({ ok: true }));
}
// Fixed: use SSE-style streaming and keep the connection open.
import { ChatOpenAI } from "@langchain/openai";
export async function POST(req: Request) {
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
streaming: true,
});
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
try {
const stream = await llm.stream("Write a short summary of risk controls.");
for await (const chunk of stream) {
controller.enqueue(
encoder.encode(`data: ${chunk.content ?? ""}\n\n`)
);
}
controller.close();
} catch (err) {
controller.error(err);
}
},
});
return new Response(readable, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache, no-transform",
Connection: "keep-alive",
},
});
}
If you’re behind Nginx, also disable buffering:
location /api/stream {
proxy_buffering off;
proxy_cache off;
chunked_transfer_encoding on;
}
Other Possible Causes
1. Your runtime kills long requests
This is common in serverless environments where the platform enforces execution limits.
// Example symptom:
// The request works for small outputs, then gets cut off at ~10-30 seconds.
export const runtime = "edge"; // or serverless with tight timeout
Fix by moving to a Node runtime with longer request support:
export const runtime = "nodejs";
export const maxDuration = 60;
2. You are creating multiple streams per request
If you call .stream() more than once and only consume one of them, the other can be aborted or dropped.
// Broken
const s1 = await chain.stream(input);
const s2 = await chain.stream(input); // accidental duplicate
for await (const chunk of s1) {
console.log(chunk);
}
Use one stream per request path:
// Fixed
const stream = await chain.stream(input);
for await (const chunk of stream) {
console.log(chunk);
}
3. Proxy or CDN buffering is enabled
Cloudflare, Nginx, and some ALBs buffer responses unless explicitly configured otherwise.
Cache-Control: no-cache, no-transform
Content-Type: text/event-stream
Connection: keep-alive
X-Accel-Buffering: no
If X-Accel-Buffering is missing behind Nginx, chunks may appear only at the end or get dropped under load.
4. You are not handling backpressure correctly
If you enqueue too fast into a slow client connection, some runtimes will terminate the response.
// Broken pattern: firehose writes without respecting downstream speed.
controller.enqueue(encoder.encode(token));
For high-volume streams, batch small chunks:
let buffer = "";
for await (const chunk of stream) {
buffer += chunk.content ?? "";
if (buffer.length > 32) {
controller.enqueue(encoder.encode(`data: ${buffer}\n\n`));
buffer = "";
}
}
How to Debug It
- •
Confirm where the cutoff happens
- •Log before and after every major step:
- •request received
- •model stream started
- •first token received
- •last token sent
- •If you never log
first token received, this is upstream model/network. - •If you log tokens locally but not in production, it’s infrastructure.
- •Log before and after every major step:
- •
Check for platform timeouts
- •Look at your deployment config:
- •Vercel
maxDuration - •AWS Lambda timeout
- •Cloud Run request timeout
- •API Gateway idle timeout
- •Vercel
- •If output stops at a consistent time boundary, this is likely your answer.
- •Look at your deployment config:
- •
Bypass proxies temporarily
- •Hit the app directly on localhost or on a raw instance.
- •If it works direct but fails through Nginx/ALB/CDN, buffering is your issue.
- •Add these headers and retest:
X-Accel-Buffering: no Cache-Control: no-cache, no-transform
- •
Turn off framework abstractions
- •Test with plain Web Streams before adding LangChain wrappers.
- •Then test
ChatOpenAI.stream(). - •Then test your
RunnableSequence/RunnableLambdapipeline. - •This isolates whether the cutoff comes from LangChain composition or transport.
Prevention
- •Use Node runtimes for production streaming unless you have verified edge support end to end.
- •Treat streaming as transport code first:
- •set SSE headers explicitly
- •disable proxy buffering
- •keep one request tied to one active stream
- •Add an integration test that asserts partial output arrives before completion:
expect(receivedChunks.length).toBeGreaterThan(0);
expect(fullText).toContain("risk");
If you’re seeing streaming response cutoff when scaling only after traffic increases, stop looking at LangChain internals first. In almost every case I’ve debugged, the fix was in connection handling, runtime limits, or proxy behavior — not in ChatOpenAI, RunnableSequence, or token generation itself.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit