How to Fix 'streaming response cutoff' in LlamaIndex (Python)
If you’re seeing streaming response cutoff in LlamaIndex, it usually means the stream ended before the full model response was consumed. In practice, this shows up when you’re using streaming=True, iterating incorrectly, or letting the process exit before the generator finishes.
The error often appears with QueryEngine, ChatEngine, or response.response_gen when you expect a full answer but only get a partial token stream.
The Most Common Cause
The #1 cause is consuming the streaming generator incorrectly, or not consuming it at all.
In LlamaIndex, streaming responses are lazy. If you call a streaming API and then print the object directly, or stop iterating early, you’ll get a cutoff instead of the full response.
Broken vs fixed pattern
| Broken | Fixed |
|---|---|
| Returns the stream object without consuming it | Iterates through the stream or uses .response after completion |
| Exits scope before stream finishes | Keeps execution alive until all tokens are read |
# BROKEN
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Summarize the document")
# This prints a StreamingResponse object, not the full text
print(response)
# Or this can trigger cutoff if you stop reading too early
for token in response.response_gen:
print(token, end="")
break
# FIXED
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Summarize the document")
# Consume the entire generator
for token in response.response_gen:
print(token, end="", flush=True)
print("\n--- done ---")
If you want the final string instead of token-by-token output, don’t use streaming:
query_engine = index.as_query_engine(streaming=False)
response = query_engine.query("Summarize the document")
print(response)
Other Possible Causes
1) Your app exits before the stream completes
This is common in scripts, FastAPI handlers, notebooks with interrupted cells, and background jobs.
# BAD: process ends too early
response = query_engine.query("What is in this file?")
for token in response.response_gen:
print(token, end="")
# no flush or lifecycle control
Use explicit flushing and keep the process alive until iteration finishes.
# GOOD
for token in response.response_gen:
print(token, end="", flush=True)
2) Timeout from your LLM client or proxy
A reverse proxy, gateway timeout, or SDK timeout can cut off long generations.
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="gpt-4o-mini",
request_timeout=120,
)
If you’re behind Nginx, load balancers, or API gateways, check their idle timeout too. The Python code may be fine while infrastructure kills the connection mid-stream.
3) Token limits are too low
A short max_tokens can look like a cutoff when it’s actually just an early stop.
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="gpt-4o-mini",
max_tokens=128,
)
Increase output budget if your prompt expects long answers.
llm = OpenAI(
model="gpt-4o-mini",
max_tokens=1024,
)
Also check whether your prompt plus retrieved context is crowding out output tokens. In LlamaIndex retrieval pipelines, large context windows can leave very little room for completion.
4) Async code is being used like sync code
If you call an async streaming API without awaiting it properly, you can get incomplete output or weird shutdown behavior.
# BAD: async method not awaited correctly
response = await query_engine.aquery("Explain this policy")
async for token in response.response_gen:
print(token, end="")
Make sure your event loop stays open and that you use await/async for consistently.
# GOOD
response = await query_engine.aquery("Explain this policy")
async for token in response.response_gen:
print(token, end="", flush=True)
How to Debug It
- •
Confirm whether you’re actually streaming
- •Check whether you set
streaming=Trueonas_query_engine()oras_chat_engine(). - •If not needed, disable streaming and see if the problem disappears.
- •Check whether you set
- •
Inspect what type you got back
- •In LlamaIndex you may receive a
StreamingResponse, not a plain string. - •Print
type(response)and verify whether you should read from.response_gen.
- •In LlamaIndex you may receive a
- •
Test with a minimal prompt
- •Use a tiny query like
"Say hello"to rule out prompt length or retrieval issues. - •If small prompts work but long ones fail, look at token limits and context size.
- •Use a tiny query like
- •
Check infrastructure timeouts
- •Compare local script behavior with Docker, FastAPI, Celery, Kubernetes, or serverless.
- •If cutoff only happens in production, suspect proxy timeout or request lifecycle termination.
Prevention
- •Use streaming only when you need token-by-token UX; otherwise keep
streaming=Falseand return full text. - •Always consume
StreamingResponse.response_genfully before ending the request. - •Set explicit timeouts and token budgets in your LLM config so truncation is predictable instead of accidental.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit