How to Fix 'streaming response cutoff' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21

streaming-response-cutoffllamaindexpython

If you’re seeing streaming response cutoff in LlamaIndex, it usually means the stream ended before the full model response was consumed. In practice, this shows up when you’re using streaming=True, iterating incorrectly, or letting the process exit before the generator finishes.

The error often appears with QueryEngine, ChatEngine, or response.response_gen when you expect a full answer but only get a partial token stream.

The Most Common Cause

The #1 cause is consuming the streaming generator incorrectly, or not consuming it at all.

In LlamaIndex, streaming responses are lazy. If you call a streaming API and then print the object directly, or stop iterating early, you’ll get a cutoff instead of the full response.

Broken vs fixed pattern

Broken	Fixed
Returns the stream object without consuming it	Iterates through the stream or uses `.response` after completion
Exits scope before stream finishes	Keeps execution alive until all tokens are read

# BROKEN
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Summarize the document")

# This prints a StreamingResponse object, not the full text
print(response)

# Or this can trigger cutoff if you stop reading too early
for token in response.response_gen:
    print(token, end="")
    break

# FIXED
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Summarize the document")

# Consume the entire generator
for token in response.response_gen:
    print(token, end="", flush=True)

print("\n--- done ---")

If you want the final string instead of token-by-token output, don’t use streaming:

query_engine = index.as_query_engine(streaming=False)
response = query_engine.query("Summarize the document")
print(response)

Other Possible Causes

1) Your app exits before the stream completes

This is common in scripts, FastAPI handlers, notebooks with interrupted cells, and background jobs.

# BAD: process ends too early
response = query_engine.query("What is in this file?")
for token in response.response_gen:
    print(token, end="")
    # no flush or lifecycle control

Use explicit flushing and keep the process alive until iteration finishes.

# GOOD
for token in response.response_gen:
    print(token, end="", flush=True)

2) Timeout from your LLM client or proxy

A reverse proxy, gateway timeout, or SDK timeout can cut off long generations.

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o-mini",
    request_timeout=120,
)

If you’re behind Nginx, load balancers, or API gateways, check their idle timeout too. The Python code may be fine while infrastructure kills the connection mid-stream.

3) Token limits are too low

A short max_tokens can look like a cutoff when it’s actually just an early stop.

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o-mini",
    max_tokens=128,
)

Increase output budget if your prompt expects long answers.

llm = OpenAI(
    model="gpt-4o-mini",
    max_tokens=1024,
)

Also check whether your prompt plus retrieved context is crowding out output tokens. In LlamaIndex retrieval pipelines, large context windows can leave very little room for completion.

4) Async code is being used like sync code

If you call an async streaming API without awaiting it properly, you can get incomplete output or weird shutdown behavior.

# BAD: async method not awaited correctly
response = await query_engine.aquery("Explain this policy")
async for token in response.response_gen:
    print(token, end="")

Make sure your event loop stays open and that you use await/async for consistently.

# GOOD
response = await query_engine.aquery("Explain this policy")
async for token in response.response_gen:
    print(token, end="", flush=True)

How to Debug It

•
Confirm whether you’re actually streaming
- •Check whether you set streaming=True on as_query_engine() or as_chat_engine().
- •If not needed, disable streaming and see if the problem disappears.
•
Inspect what type you got back
- •In LlamaIndex you may receive a StreamingResponse, not a plain string.
- •Print type(response) and verify whether you should read from .response_gen.
•
Test with a minimal prompt
- •Use a tiny query like "Say hello" to rule out prompt length or retrieval issues.
- •If small prompts work but long ones fail, look at token limits and context size.
•
Check infrastructure timeouts
- •Compare local script behavior with Docker, FastAPI, Celery, Kubernetes, or serverless.
- •If cutoff only happens in production, suspect proxy timeout or request lifecycle termination.

Prevention

•Use streaming only when you need token-by-token UX; otherwise keep streaming=False and return full text.
•Always consume StreamingResponse.response_gen fully before ending the request.
•Set explicit timeouts and token budgets in your LLM config so truncation is predictable instead of accidental.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit