How to Fix 'streaming response cutoff in production' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-21

streaming-response-cutoff-in-productionllamaindextypescript

When you see streaming response cutoff in production with LlamaIndex TypeScript, it usually means the stream was started successfully but got terminated before the full response was consumed. In practice, this shows up when your serverless function, reverse proxy, client, or Node runtime closes the connection early.

This is almost always a transport or lifecycle problem, not an LLM problem. The model started generating; something in your app stopped reading or stopped forwarding the stream.

The Most Common Cause

The #1 cause is returning a streaming response from a route handler, then letting the runtime or framework close the request before the stream is fully drained.

With LlamaIndex TS, this often happens when using agent.chat() or queryEngine.query() with streaming enabled, but not properly piping the AsyncIterable/response stream to the HTTP response.

Broken pattern	Fixed pattern
Start streaming and forget to keep the connection open	Pipe every chunk to the client until completion

// ❌ Broken: stream is created but not fully forwarded
import { OpenAI } from "llamaindex";
import { createServer } from "http";

const llm = new OpenAI({ model: "gpt-4o-mini" });

createServer(async (req, res) => {
  if (req.url !== "/chat") {
    res.statusCode = 404;
    return res.end();
  }

  const response = await llm.complete({
    prompt: "Write a long explanation of insurance claims handling",
    stream: true,
  });

  // Mistake: only reads one chunk / doesn't drain the stream
  const firstChunk = await response.read();
  res.setHeader("Content-Type", "text/plain");
  res.end(firstChunk?.delta ?? "");
}).listen(3000);

// ✅ Fixed: drain and forward the whole stream
import { OpenAI } from "llamaindex";
import { createServer } from "http";

const llm = new OpenAI({ model: "gpt-4o-mini" });

createServer(async (req, res) => {
  if (req.url !== "/chat") {
    res.statusCode = 404;
    return res.end();
  }

  const response = await llm.complete({
    prompt: "Write a long explanation of insurance claims handling",
    stream: true,
  });

  res.writeHead(200, {
    "Content-Type": "text/plain; charset=utf-8",
    "Transfer-Encoding": "chunked",
  });

  for await (const chunk of response) {
    res.write(chunk.delta ?? "");
  }

  res.end();
}).listen(3000);

If you are using AgentRunner, ReActAgent, or QueryEngine, the same rule applies: do not just trigger streaming. Consume it all the way through.

Other Possible Causes

1) Serverless timeout kills the request

On Vercel, AWS Lambda, Cloudflare Workers, or similar platforms, your function may time out before streaming completes.

export const maxDuration = 10; // Vercel example

// If your model takes longer than this, you'll see truncated output.

Fix by increasing timeout limits where possible, reducing token output, or switching to a longer-lived runtime.

2) Reverse proxy buffering or idle timeout

Nginx, ALB, API Gateway, and some corporate proxies buffer responses or close idle connections.

location /api/ {
  proxy_buffering off;
  proxy_read_timeout 300s;
  proxy_send_timeout 300s;
}

If buffering is on, chunks may not reach the client until too late. If idle timeout is too low, long generations get cut off mid-stream.

3) Client disconnects early

The browser tab closes, React component unmounts, or your frontend aborts fetch with AbortController.

const controller = new AbortController();

fetch("/api/chat", {
  method: "POST",
  signal: controller.signal,
});

// later...
controller.abort(); // cuts off streaming immediately

If you see logs like AbortError: The operation was aborted, this is likely your issue.

4) Misusing LlamaIndex stream APIs

A common mistake is mixing non-streaming and streaming methods. For example, calling a method that returns a final string while expecting token-by-token delivery.

// ❌ Wrong expectation
const result = await queryEngine.query({
  query: "Summarize policy exclusions",
});

console.log(result.response); // no live stream here

Use the streaming variant exposed by your specific class:

// ✅ Use the streaming path for live tokens
const result = await queryEngine.query({
  query: "Summarize policy exclusions",
});

for await (const chunk of result) {
  process.stdout.write(chunk.delta ?? "");
}

Depending on version and class names in your project, this may be exposed as streamChat, streamComplete, or a response object with an async iterator.

How to Debug It

•
Check whether the stream is actually being drained
- •Add logs before and after each for await loop.
- •If you only see the first chunk, your code is stopping early.
•
Differentiate app errors from transport errors
- •App-side issues look like exceptions in your handler.
- •
  Transport-side issues often show up as:
  - •AbortError
  - •ERR_STREAM_PREMATURE_CLOSE
  - •socket hang up
  - •truncated output with no LlamaIndex exception
•
Test outside your framework
- •Run the same LlamaIndex code in a plain Node script.
- •If it works there but fails behind Next.js/Vercel/Nginx, you have an infrastructure problem.
•
Reduce variables
- •Disable proxies.
- •Remove frontend abort logic.
- •Lower output size.
- •Switch to a short prompt and verify full delivery end-to-end.

Prevention

•
Always treat LlamaIndex streaming as a full lifecycle concern:
- •start stream
- •forward every chunk
- •keep connection open until completion
•
Set explicit platform timeouts and document them next to your route handlers:
- •Lambda timeout
- •Vercel maxDuration
- •Nginx proxy timeouts
•
Add integration tests that assert full streamed output arrives:

expect(receivedText).toContain("final sentence");
expect(receivedText.length).toBeGreaterThan(1000);

If you build agents for production systems like banking or insurance workflows, this is not a cosmetic bug. A cut-off stream can mean a missing policy clause summary, incomplete claim reasoning, or partial tool output that looks valid until someone audits it.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit