How to Fix 'streaming response cutoff' in LlamaIndex (TypeScript)
What the error means
streaming response cutoff usually means LlamaIndex started a streamed LLM response, but the stream ended before the library got the full text it expected. In TypeScript, this often shows up when you use streaming: true with a provider wrapper that does not fully support token streaming, or when your code consumes the stream incorrectly and closes it early.
You’ll typically hit this during chat(), complete(), or an agent run using classes like OpenAI, Anthropic, AzureOpenAI, ReActAgent, or QueryEngine with streaming enabled.
The Most Common Cause
The #1 cause is mismatched streaming handling: you enable streaming, but then treat the result like a normal completed response instead of reading the stream to completion.
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| ```ts | |
| import { OpenAI } from "llamaindex"; |
const llm = new OpenAI({ model: "gpt-4o-mini", streaming: true, });
const response = await llm.complete("Write a summary of this document.");
console.log(response.text); // may trigger cutoff behavior
|ts
import { OpenAI } from "llamaindex";
const llm = new OpenAI({ model: "gpt-4o-mini", streaming: true, });
const stream = await llm.complete("Write a summary of this document.");
let fullText = ""; for await (const chunk of stream) { fullText += chunk.delta ?? ""; }
console.log(fullText);
With LlamaIndex TypeScript, `streaming: true` changes the return type. You don’t get a normal final string immediately; you get a stream-like result that must be consumed until completion. If your code exits early, throws inside the loop, or never iterates at all, you can surface errors like:
- `Error: streaming response cutoff`
- `Error: Response stream ended before completion`
- `TypeError: response.text is undefined`
If you want a non-streaming response, turn streaming off:
```ts
import { OpenAI } from "llamaindex";
const llm = new OpenAI({
model: "gpt-4o-mini",
streaming: false,
});
const response = await llm.complete("Write a summary of this document.");
console.log(response.text);
Other Possible Causes
1) The provider does not support streaming correctly
Some model wrappers expose a streaming option, but the backend may not support token-by-token output reliably.
import { Anthropic } from "llamaindex";
const llm = new Anthropic({
model: "claude-3-haiku-20240307",
streaming: true,
});
If you see cutoff errors only on one provider, test the same prompt with another supported provider or disable streaming to confirm.
2) Your HTTP layer is cutting off the connection
Proxy timeouts, serverless function limits, and reverse proxies can terminate the SSE/HTTP stream early.
Common offenders:
// Next.js / Vercel route timeout example
export const maxDuration = 5;
Or an upstream proxy with short idle timeout. If your prompt is long or your model is slow, the connection may die before completion.
3) You are closing over an aborted request signal
If you pass an abort signal from a web request and that request ends, LlamaIndex will stop reading the stream.
const controller = new AbortController();
setTimeout(() => controller.abort(), 2000);
await llm.complete("Long answer please", {
signal: controller.signal,
});
That can look like a cutoff even though your own code caused it. Remove the abort signal or extend its timeout.
4) Token limits are too low for the response
Sometimes the model finishes “early” because max output tokens are too small. The SDK then receives an incomplete generation and reports it as truncated.
const llm = new OpenAI({
model: "gpt-4o-mini",
streaming: true,
maxTokens: 50,
});
Increase output tokens:
const llm = new OpenAI({
model: "gpt-4o-mini",
streaming: true,
maxTokens: 512,
});
How to Debug It
- •
Turn off streaming first
- •Set
streaming: false. - •If the error disappears, your issue is in stream handling or transport, not prompt content.
- •Set
- •
Log the actual return type
- •Check whether you’re getting a plain response or a stream object.
- •In TypeScript, inspect the type returned by
complete()orchat()before calling.text.
- •
Test with a minimal prompt
- •Use something tiny like
"Say hello in one sentence." - •If small prompts work but longer ones fail, look at token limits or network timeouts.
- •Use something tiny like
- •
Remove infrastructure variables
- •Run locally without proxies.
- •Disable abort signals.
- •Increase serverless duration.
- •If it works locally but fails in production, it’s usually transport-related.
Prevention
- •Keep streaming and non-streaming paths separate in your codebase.
- •Consume streams fully with
for await...of; don’t treat streamed responses like finished strings. - •Set sane production limits:
- •enough
maxTokens - •enough request timeout
- •no aggressive abort controllers on long generations
- •enough
If you’re wiring LlamaIndex into an API route, make one decision up front:
- •Need partial tokens in UI? Use streaming end-to-end.
- •Need stable backend processing? Disable streaming and use final responses only.
That one choice avoids most streaming response cutoff incidents I see in TypeScript projects.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit