How to Fix 'streaming response cutoff during development' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
streaming-response-cutoff-during-developmentllamaindexpython

When you see streaming response cutoff during development, it usually means your LlamaIndex app started streaming tokens, then the stream got interrupted before the full response was consumed. In practice, this shows up during local development when you call a streaming query, print only part of the generator, or let the process exit before the stream is drained.

The key thing: this is usually not a model failure. It’s an application flow problem around StreamingResponse, response_gen, or how your event loop / web server handles the stream.

The Most Common Cause

The #1 cause is treating a streaming response like a normal string response.

In LlamaIndex, methods like query_engine.query(..., streaming=True) or chat APIs can return a streaming object instead of a completed text blob. If you don’t iterate through it fully, the response gets cut off.

Broken vs fixed

Broken patternFixed pattern
Calls streaming query but never consumes the generatorIterates through response.response_gen or uses .print_response_stream()
Returns early from a FastAPI routeStreams the full body to the client
Prints the object directlyReads tokens until completion
# BROKEN
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(streaming=True)

response = query_engine.query("Summarize this document")

print(response)  # Often prints a StreamingResponse wrapper, not the full answer
# Process may exit before stream is consumed
# FIXED
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(streaming=True)

response = query_engine.query("Summarize this document")

for token in response.response_gen:
    print(token, end="", flush=True)

print()

If you want LlamaIndex to handle printing for you:

response = query_engine.query("Summarize this document")
response.print_response_stream()

That pattern matters because the stream stays alive only while something is actively consuming it.

Other Possible Causes

1) Your script exits before the stream finishes

This happens a lot in notebooks, CLI scripts, and quick local tests.

response = query_engine.query("Explain the policy")
for token in response.response_gen:
    print(token, end="")

# If this runs inside a short-lived script and exits immediately after,
# background cleanup can cut off unfinished output.

Fix it by keeping the process alive until iteration completes:

tokens = []
for token in response.response_gen:
    tokens.append(token)

full_text = "".join(tokens)
print(full_text)

2) You’re mixing async and sync incorrectly

LlamaIndex has async paths like aquery(). If you call async code without awaiting it properly, streams can terminate early or never start.

# BROKEN
response = await query_engine.aquery("What does this contract mean?")
print(response.response_gen)  # Not enough; still need to consume it properly

Use an async consumer:

response = await query_engine.aquery("What does this contract mean?")

async for token in response.async_response_gen:
    print(token, end="", flush=True)

3) Your web framework returns before streaming completes

FastAPI and Starlette need explicit streaming responses. Returning a partially built string from inside a route will cut off output.

# BROKEN
@app.get("/chat")
def chat():
    response = query_engine.query("Answer this question", streaming=True)
    return {"answer": str(response)}

Use StreamingResponse:

from fastapi.responses import StreamingResponse

@app.get("/chat")
def chat():
    response = query_engine.query("Answer this question", streaming=True)

    def token_stream():
        for token in response.response_gen:
            yield token

    return StreamingResponse(token_stream(), media_type="text/plain")

4) Timeout or proxy limits are killing long streams

If you’re behind Gunicorn, nginx, ALB, or a local dev proxy, idle timeouts can truncate streamed output.

proxy_read_timeout 300;
proxy_send_timeout 300;
chunked_transfer_encoding on;

For Gunicorn:

gunicorn app:app --timeout 300 --worker-class uvicorn.workers.UvicornWorker

If your model takes time before sending the first token, low timeout values will look exactly like a “cutoff.”

How to Debug It

  1. Check whether you’re actually using streaming mode

    • Look for streaming=True, response.response_gen, print_response_stream(), or StreamingResponse.
    • If you expected normal text but got StreamingAgentChatResponse or StreamingResponse, you need to consume it correctly.
  2. Log when generation starts and ends

    • Add logs before query execution and after iteration finishes.
    • If you never hit the “finished” log, your stream is being interrupted.
  3. Test outside your framework

    • Run the same LlamaIndex call in a plain Python script.
    • If it works there but fails in FastAPI/Flask/Streamlit, the bug is in request handling or lifecycle management.
  4. Inspect timeout and cancellation behavior

    • Check browser console, reverse proxy logs, server logs, and notebook kernel output.
    • Look for cancellation messages around CancelledError, connection resets, or worker restarts.

Prevention

  • Always treat LlamaIndex streaming objects as iterators, not strings.
  • In web apps, use framework-native streaming responses instead of returning completed JSON too early.
  • Add timeout budgets at every layer: model call, app server, reverse proxy, and client.

If you want one rule to remember: don’t let a streaming LlamaIndex response go unconsumed. That’s what turns a valid partial generation into a cutoff during development.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides