How to Fix 'streaming response cutoff during development' in LlamaIndex (Python)
When you see streaming response cutoff during development, it usually means your LlamaIndex app started streaming tokens, then the stream got interrupted before the full response was consumed. In practice, this shows up during local development when you call a streaming query, print only part of the generator, or let the process exit before the stream is drained.
The key thing: this is usually not a model failure. It’s an application flow problem around StreamingResponse, response_gen, or how your event loop / web server handles the stream.
The Most Common Cause
The #1 cause is treating a streaming response like a normal string response.
In LlamaIndex, methods like query_engine.query(..., streaming=True) or chat APIs can return a streaming object instead of a completed text blob. If you don’t iterate through it fully, the response gets cut off.
Broken vs fixed
| Broken pattern | Fixed pattern |
|---|---|
| Calls streaming query but never consumes the generator | Iterates through response.response_gen or uses .print_response_stream() |
| Returns early from a FastAPI route | Streams the full body to the client |
| Prints the object directly | Reads tokens until completion |
# BROKEN
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Summarize this document")
print(response) # Often prints a StreamingResponse wrapper, not the full answer
# Process may exit before stream is consumed
# FIXED
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Summarize this document")
for token in response.response_gen:
print(token, end="", flush=True)
print()
If you want LlamaIndex to handle printing for you:
response = query_engine.query("Summarize this document")
response.print_response_stream()
That pattern matters because the stream stays alive only while something is actively consuming it.
Other Possible Causes
1) Your script exits before the stream finishes
This happens a lot in notebooks, CLI scripts, and quick local tests.
response = query_engine.query("Explain the policy")
for token in response.response_gen:
print(token, end="")
# If this runs inside a short-lived script and exits immediately after,
# background cleanup can cut off unfinished output.
Fix it by keeping the process alive until iteration completes:
tokens = []
for token in response.response_gen:
tokens.append(token)
full_text = "".join(tokens)
print(full_text)
2) You’re mixing async and sync incorrectly
LlamaIndex has async paths like aquery(). If you call async code without awaiting it properly, streams can terminate early or never start.
# BROKEN
response = await query_engine.aquery("What does this contract mean?")
print(response.response_gen) # Not enough; still need to consume it properly
Use an async consumer:
response = await query_engine.aquery("What does this contract mean?")
async for token in response.async_response_gen:
print(token, end="", flush=True)
3) Your web framework returns before streaming completes
FastAPI and Starlette need explicit streaming responses. Returning a partially built string from inside a route will cut off output.
# BROKEN
@app.get("/chat")
def chat():
response = query_engine.query("Answer this question", streaming=True)
return {"answer": str(response)}
Use StreamingResponse:
from fastapi.responses import StreamingResponse
@app.get("/chat")
def chat():
response = query_engine.query("Answer this question", streaming=True)
def token_stream():
for token in response.response_gen:
yield token
return StreamingResponse(token_stream(), media_type="text/plain")
4) Timeout or proxy limits are killing long streams
If you’re behind Gunicorn, nginx, ALB, or a local dev proxy, idle timeouts can truncate streamed output.
proxy_read_timeout 300;
proxy_send_timeout 300;
chunked_transfer_encoding on;
For Gunicorn:
gunicorn app:app --timeout 300 --worker-class uvicorn.workers.UvicornWorker
If your model takes time before sending the first token, low timeout values will look exactly like a “cutoff.”
How to Debug It
- •
Check whether you’re actually using streaming mode
- •Look for
streaming=True,response.response_gen,print_response_stream(), orStreamingResponse. - •If you expected normal text but got
StreamingAgentChatResponseorStreamingResponse, you need to consume it correctly.
- •Look for
- •
Log when generation starts and ends
- •Add logs before query execution and after iteration finishes.
- •If you never hit the “finished” log, your stream is being interrupted.
- •
Test outside your framework
- •Run the same LlamaIndex call in a plain Python script.
- •If it works there but fails in FastAPI/Flask/Streamlit, the bug is in request handling or lifecycle management.
- •
Inspect timeout and cancellation behavior
- •Check browser console, reverse proxy logs, server logs, and notebook kernel output.
- •Look for cancellation messages around
CancelledError, connection resets, or worker restarts.
Prevention
- •Always treat LlamaIndex streaming objects as iterators, not strings.
- •In web apps, use framework-native streaming responses instead of returning completed JSON too early.
- •Add timeout budgets at every layer: model call, app server, reverse proxy, and client.
If you want one rule to remember: don’t let a streaming LlamaIndex response go unconsumed. That’s what turns a valid partial generation into a cutoff during development.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit