How to Fix 'streaming response cutoff when scaling' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
streaming-response-cutoff-when-scalinglangchaintypescript

When you see streaming response cutoff when scaling, it usually means your stream started fine, then got terminated before the model finished emitting tokens. In LangChain TypeScript, this shows up most often when you move from a single-process dev setup to multiple workers, serverless, or a load balancer that doesn’t preserve long-lived connections.

The key point: this is usually not a LangChain “bug” in the model layer. It’s almost always an infrastructure or streaming lifecycle problem around Runnable.stream(), ChatOpenAI.stream(), SSE, or WebSocket handling.

The Most Common Cause

The #1 cause is that your streaming request is being routed through an execution environment that can’t keep the connection alive long enough, or it’s being buffered by a proxy/load balancer. In practice, this happens with:

  • serverless functions with short execution windows
  • reverse proxies buffering chunked responses
  • multiple app instances without sticky sessions
  • request handlers that return before the stream is fully consumed

Here’s the broken pattern I see most often.

BrokenFixed
Stream is started inside a request handler, but the response lifecycle is not tied to the streamKeep the HTTP connection open until the stream completes
Uses res.json() after starting streamingUses SSE/chunked transfer and writes each token as it arrives
Works locally, fails behind Nginx / ALB / VercelConfigures proxy buffering off and uses a runtime that supports long-lived connections
// Broken: starts streaming but doesn't properly hold the HTTP response open.
import { ChatOpenAI } from "@langchain/openai";

export async function POST(req: Request) {
  const llm = new ChatOpenAI({
    model: "gpt-4o-mini",
    streaming: true,
  });

  const stream = await llm.stream("Write a short summary of risk controls.");

  // This looks like it should work, but in many deployments
  // the response gets cut off because nothing is actually piping
  // tokens to the client.
  return new Response(JSON.stringify({ ok: true }));
}
// Fixed: use SSE-style streaming and keep the connection open.
import { ChatOpenAI } from "@langchain/openai";

export async function POST(req: Request) {
  const llm = new ChatOpenAI({
    model: "gpt-4o-mini",
    streaming: true,
  });

  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      try {
        const stream = await llm.stream("Write a short summary of risk controls.");

        for await (const chunk of stream) {
          controller.enqueue(
            encoder.encode(`data: ${chunk.content ?? ""}\n\n`)
          );
        }

        controller.close();
      } catch (err) {
        controller.error(err);
      }
    },
  });

  return new Response(readable, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache, no-transform",
      Connection: "keep-alive",
    },
  });
}

If you’re behind Nginx, also disable buffering:

location /api/stream {
  proxy_buffering off;
  proxy_cache off;
  chunked_transfer_encoding on;
}

Other Possible Causes

1. Your runtime kills long requests

This is common in serverless environments where the platform enforces execution limits.

// Example symptom:
// The request works for small outputs, then gets cut off at ~10-30 seconds.
export const runtime = "edge"; // or serverless with tight timeout

Fix by moving to a Node runtime with longer request support:

export const runtime = "nodejs";
export const maxDuration = 60;

2. You are creating multiple streams per request

If you call .stream() more than once and only consume one of them, the other can be aborted or dropped.

// Broken
const s1 = await chain.stream(input);
const s2 = await chain.stream(input); // accidental duplicate

for await (const chunk of s1) {
  console.log(chunk);
}

Use one stream per request path:

// Fixed
const stream = await chain.stream(input);
for await (const chunk of stream) {
  console.log(chunk);
}

3. Proxy or CDN buffering is enabled

Cloudflare, Nginx, and some ALBs buffer responses unless explicitly configured otherwise.

Cache-Control: no-cache, no-transform
Content-Type: text/event-stream
Connection: keep-alive
X-Accel-Buffering: no

If X-Accel-Buffering is missing behind Nginx, chunks may appear only at the end or get dropped under load.

4. You are not handling backpressure correctly

If you enqueue too fast into a slow client connection, some runtimes will terminate the response.

// Broken pattern: firehose writes without respecting downstream speed.
controller.enqueue(encoder.encode(token));

For high-volume streams, batch small chunks:

let buffer = "";
for await (const chunk of stream) {
  buffer += chunk.content ?? "";
  if (buffer.length > 32) {
    controller.enqueue(encoder.encode(`data: ${buffer}\n\n`));
    buffer = "";
  }
}

How to Debug It

  1. Confirm where the cutoff happens

    • Log before and after every major step:
      • request received
      • model stream started
      • first token received
      • last token sent
    • If you never log first token received, this is upstream model/network.
    • If you log tokens locally but not in production, it’s infrastructure.
  2. Check for platform timeouts

    • Look at your deployment config:
      • Vercel maxDuration
      • AWS Lambda timeout
      • Cloud Run request timeout
      • API Gateway idle timeout
    • If output stops at a consistent time boundary, this is likely your answer.
  3. Bypass proxies temporarily

    • Hit the app directly on localhost or on a raw instance.
    • If it works direct but fails through Nginx/ALB/CDN, buffering is your issue.
    • Add these headers and retest:
      X-Accel-Buffering: no
      Cache-Control: no-cache, no-transform
      
  4. Turn off framework abstractions

    • Test with plain Web Streams before adding LangChain wrappers.
    • Then test ChatOpenAI.stream().
    • Then test your RunnableSequence / RunnableLambda pipeline.
    • This isolates whether the cutoff comes from LangChain composition or transport.

Prevention

  • Use Node runtimes for production streaming unless you have verified edge support end to end.
  • Treat streaming as transport code first:
    • set SSE headers explicitly
    • disable proxy buffering
    • keep one request tied to one active stream
  • Add an integration test that asserts partial output arrives before completion:
expect(receivedChunks.length).toBeGreaterThan(0);
expect(fullText).toContain("risk");

If you’re seeing streaming response cutoff when scaling only after traffic increases, stop looking at LangChain internals first. In almost every case I’ve debugged, the fix was in connection handling, runtime limits, or proxy behavior — not in ChatOpenAI, RunnableSequence, or token generation itself.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides