How to Fix 'streaming response cutoff in production' in LlamaIndex (Python)
When you see streaming response cutoff in production, it usually means your LlamaIndex stream started correctly, then got terminated before the full response could be delivered to the client. In practice, this shows up in FastAPI, Flask, Streamlit, or any reverse-proxy setup where the HTTP connection closes early, the generator is garbage-collected, or your app buffers the stream instead of forwarding chunks.
The important part: this is usually not a model problem. It’s an app lifecycle, transport, or streaming-iterator problem.
The Most Common Cause
The #1 cause is returning a streaming generator from a request handler without keeping the response open long enough for all tokens to flush.
With LlamaIndex, people often use StreamingResponse and response.response_gen, but they accidentally consume the generator too early, return the wrong object, or let the request context end before streaming completes.
| Broken pattern | Fixed pattern |
|---|---|
| Returns a plain string after starting a stream | Returns StreamingResponse directly from the generator |
Iterates over response_gen inside the route | Passes response_gen through unchanged |
| Lets FastAPI close the request before chunks flush | Keeps the coroutine alive until stream completion |
# BROKEN
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from llama_index.core import VectorStoreIndex
app = FastAPI()
@app.get("/chat")
async def chat(q: str):
query_engine = index.as_query_engine(streaming=True)
response = query_engine.query(q)
# This consumes the stream too early.
chunks = []
for token in response.response_gen:
chunks.append(token)
# By now you've buffered everything and may still cut off under load.
return {"answer": "".join(chunks)}
# FIXED
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from llama_index.core import VectorStoreIndex
app = FastAPI()
@app.get("/chat")
async def chat(q: str):
query_engine = index.as_query_engine(streaming=True)
response = query_engine.query(q)
def token_stream():
for token in response.response_gen:
yield token
return StreamingResponse(token_stream(), media_type="text/plain")
If you’re using ChatEngine, same rule applies:
# FIXED WITH CHATENGINE
chat_engine = index.as_chat_engine(streaming=True)
response = chat_engine.stream_chat("What is the policy on refunds?")
return StreamingResponse(response.response_gen, media_type="text/plain")
Other Possible Causes
1) Your proxy is buffering or timing out
If you run behind Nginx, ALB, Cloudflare, or API Gateway, they may buffer the upstream response or kill idle connections.
location /chat {
proxy_buffering off;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
If you’re on AWS ALB, check idle timeout. A 60-second idle timeout will cut off long generations even if your Python code is fine.
2) You’re using sync code inside an async endpoint
LlamaIndex streaming can stall if you block the event loop with heavy CPU work or synchronous I/O.
# BAD
@app.get("/chat")
async def chat(q: str):
result = expensive_local_rerank() # blocks event loop
response = query_engine.query(q)
return StreamingResponse(response.response_gen)
Move blocking work out of the request path or use run_in_executor.
3) The client disconnects before completion
Browsers, mobile apps, and frontend frameworks often cancel fetches when components unmount. On the server side this looks like a cutoff.
from starlette.requests import Request
@app.get("/chat")
async def chat(request: Request, q: str):
response = query_engine.query(q)
async def stream():
for token in response.response_gen:
if await request.is_disconnected():
break
yield token
return StreamingResponse(stream(), media_type="text/plain")
If you don’t check disconnects, you’ll see partial output and misleading logs like:
- •
ClientDisconnect - •
CancelledError - •incomplete
StreamingResponse
4) You’re mixing incompatible LlamaIndex versions
Older examples used different response objects and streaming APIs. If your code assumes one version but your environment has another, you’ll get strange partial-stream behavior.
Check these first:
pip show llama-index
pip freeze | grep llama-index
Make sure your code matches the installed API:
- •
query_engine.query(...)with.response_gen - •
chat_engine.stream_chat(...)with.response_gen - •newer package split modules under
llama_index.core
How to Debug It
- •Confirm where cutoff happens
- •Log before streaming starts.
- •Log each yielded token count.
- •If logs stop mid-stream, it’s transport or disconnect related.
for i, token in enumerate(response.response_gen):
print(f"token={i}")
yield token
- •
Test without your proxy
- •Hit Uvicorn directly on localhost.
- •If it works locally but fails behind Nginx/ALB, fix buffering/timeouts first.
- •
Check whether you are buffering accidentally
- •Search for
"".join(...),list(response.response_gen), or returning JSON. - •Those patterns destroy true streaming.
- •Search for
- •
Inspect server logs for cancellation
- •Look for:
- •
asyncio.CancelledError - •
ClientDisconnect - •worker restart messages from Gunicorn/Uvicorn
- •
- •If present, increase worker timeout and check client behavior.
- •Look for:
Prevention
- •Use
StreamingResponseend-to-end; do not materializeresponse.response_geninto a list unless you explicitly want non-streaming output. - •Set sane infrastructure timeouts:
- •Uvicorn/Gunicorn worker timeout
- •Nginx proxy read timeout
- •ALB idle timeout
- •Add disconnect handling in long-lived streams so your app exits cleanly when clients drop.
- •Pin LlamaIndex versions and test streaming after every upgrade; API drift is common across releases.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit