How to Fix 'streaming response cutoff in production' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21

streaming-response-cutoff-in-productionllamaindexpython

When you see streaming response cutoff in production, it usually means your LlamaIndex stream started correctly, then got terminated before the full response could be delivered to the client. In practice, this shows up in FastAPI, Flask, Streamlit, or any reverse-proxy setup where the HTTP connection closes early, the generator is garbage-collected, or your app buffers the stream instead of forwarding chunks.

The important part: this is usually not a model problem. It’s an app lifecycle, transport, or streaming-iterator problem.

The Most Common Cause

The #1 cause is returning a streaming generator from a request handler without keeping the response open long enough for all tokens to flush.

With LlamaIndex, people often use StreamingResponse and response.response_gen, but they accidentally consume the generator too early, return the wrong object, or let the request context end before streaming completes.

Broken pattern	Fixed pattern
Returns a plain string after starting a stream	Returns `StreamingResponse` directly from the generator
Iterates over `response_gen` inside the route	Passes `response_gen` through unchanged
Lets FastAPI close the request before chunks flush	Keeps the coroutine alive until stream completion

# BROKEN
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from llama_index.core import VectorStoreIndex

app = FastAPI()

@app.get("/chat")
async def chat(q: str):
    query_engine = index.as_query_engine(streaming=True)
    response = query_engine.query(q)

    # This consumes the stream too early.
    chunks = []
    for token in response.response_gen:
        chunks.append(token)

    # By now you've buffered everything and may still cut off under load.
    return {"answer": "".join(chunks)}

# FIXED
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from llama_index.core import VectorStoreIndex

app = FastAPI()

@app.get("/chat")
async def chat(q: str):
    query_engine = index.as_query_engine(streaming=True)
    response = query_engine.query(q)

    def token_stream():
        for token in response.response_gen:
            yield token

    return StreamingResponse(token_stream(), media_type="text/plain")

If you’re using ChatEngine, same rule applies:

# FIXED WITH CHATENGINE
chat_engine = index.as_chat_engine(streaming=True)
response = chat_engine.stream_chat("What is the policy on refunds?")

return StreamingResponse(response.response_gen, media_type="text/plain")

Other Possible Causes

1) Your proxy is buffering or timing out

If you run behind Nginx, ALB, Cloudflare, or API Gateway, they may buffer the upstream response or kill idle connections.

location /chat {
    proxy_buffering off;
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;
}

If you’re on AWS ALB, check idle timeout. A 60-second idle timeout will cut off long generations even if your Python code is fine.

2) You’re using sync code inside an async endpoint

LlamaIndex streaming can stall if you block the event loop with heavy CPU work or synchronous I/O.

# BAD
@app.get("/chat")
async def chat(q: str):
    result = expensive_local_rerank()  # blocks event loop
    response = query_engine.query(q)
    return StreamingResponse(response.response_gen)

Move blocking work out of the request path or use run_in_executor.

3) The client disconnects before completion

Browsers, mobile apps, and frontend frameworks often cancel fetches when components unmount. On the server side this looks like a cutoff.

from starlette.requests import Request

@app.get("/chat")
async def chat(request: Request, q: str):
    response = query_engine.query(q)

    async def stream():
        for token in response.response_gen:
            if await request.is_disconnected():
                break
            yield token

    return StreamingResponse(stream(), media_type="text/plain")

If you don’t check disconnects, you’ll see partial output and misleading logs like:

•ClientDisconnect
•CancelledError
•incomplete StreamingResponse

4) You’re mixing incompatible LlamaIndex versions

Older examples used different response objects and streaming APIs. If your code assumes one version but your environment has another, you’ll get strange partial-stream behavior.

Check these first:

pip show llama-index
pip freeze | grep llama-index

Make sure your code matches the installed API:

•query_engine.query(...) with .response_gen
•chat_engine.stream_chat(...) with .response_gen
•newer package split modules under llama_index.core

How to Debug It

•
Confirm where cutoff happens
- •Log before streaming starts.
- •Log each yielded token count.
- •If logs stop mid-stream, it’s transport or disconnect related.

for i, token in enumerate(response.response_gen):
    print(f"token={i}")
    yield token

•
Test without your proxy
- •Hit Uvicorn directly on localhost.
- •If it works locally but fails behind Nginx/ALB, fix buffering/timeouts first.
•
Check whether you are buffering accidentally
- •Search for "".join(...), list(response.response_gen), or returning JSON.
- •Those patterns destroy true streaming.
•
Inspect server logs for cancellation
- •
  Look for:
  - •asyncio.CancelledError
  - •ClientDisconnect
  - •worker restart messages from Gunicorn/Uvicorn
- •If present, increase worker timeout and check client behavior.

Prevention

•Use StreamingResponse end-to-end; do not materialize response.response_gen into a list unless you explicitly want non-streaming output.
•
Set sane infrastructure timeouts:
- •Uvicorn/Gunicorn worker timeout
- •Nginx proxy read timeout
- •ALB idle timeout
•Add disconnect handling in long-lived streams so your app exits cleanly when clients drop.
•Pin LlamaIndex versions and test streaming after every upgrade; API drift is common across releases.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit