How to Fix 'streaming response cutoff when scaling' in LlamaIndex (Python)
A streaming response cutoff when scaling error in LlamaIndex usually means your streamed output is being interrupted before the final tokens are flushed to the client. In practice, it shows up when you move from local testing to multiple workers, async handlers, reverse proxies, or a serverless runtime that does not keep the connection open long enough.
The key detail: this is usually not a “LlamaIndex bug” in the core retrieval pipeline. It is almost always a transport, lifecycle, or buffering problem around StreamingResponse, ResponseSynthesizer, or your web server.
The Most Common Cause
The #1 cause is returning a stream from a request handler while the underlying generator depends on objects that get garbage-collected, closed, or outlived by the request scope.
This happens a lot when people build a query engine with streaming=True, then wrap it in FastAPI or another ASGI app and return the iterator directly without keeping the response source alive.
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Creates the stream inside a short-lived scope and returns an iterator tied to it | Keeps the generator and query engine alive for the full request |
| Often fails under load balancing or multiple workers | Works reliably with ASGI streaming |
Can surface as truncated output or StreamingResponse cutoff | Flushes tokens until completion |
# BROKEN
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from llama_index.core import VectorStoreIndex
app = FastAPI()
@app.get("/chat")
def chat():
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(streaming=True)
# This stream is tied to local objects that may not survive cleanly
response = query_engine.query("Summarize the policy changes.")
return StreamingResponse(response.response_gen, media_type="text/plain")
# FIXED
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from llama_index.core import VectorStoreIndex
app = FastAPI()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(streaming=True)
@app.get("/chat")
async def chat():
response = query_engine.query("Summarize the policy changes.")
def token_stream():
for token in response.response_gen:
yield token
return StreamingResponse(token_stream(), media_type="text/plain")
If you are using StreamingResponse, keep the stream generator simple and ensure the response_gen remains valid for the full duration of the request. In real deployments, I also recommend building the index at startup, not per-request.
Other Possible Causes
1. Reverse proxy buffering
Nginx, ALB, Cloudflare, or an API gateway can buffer chunks and make streaming look like it cut off.
location /chat {
proxy_buffering off;
proxy_cache off;
chunked_transfer_encoding on;
}
If buffering is enabled, tokens may arrive late or stop appearing after a certain payload size.
2. Worker timeout killing long streams
Gunicorn/Uvicorn worker timeouts will terminate long-running streams mid-response.
gunicorn app:app \
-k uvicorn.workers.UvicornWorker \
--workers 4 \
--timeout 120
If your RAG chain takes longer than the default timeout, you will see incomplete output even though LlamaIndex was still generating.
3. Async/sync mismatch in your handler
Calling sync LlamaIndex code from an async endpoint can block event loop progress and interrupt streaming behavior under concurrency.
# BAD: blocking sync call inside async route
@app.get("/answer")
async def answer():
response = query_engine.query("What changed?")
return StreamingResponse(response.response_gen)
Use an async-friendly path if your stack supports it, or move blocking work into a thread pool.
# BETTER: isolate blocking work
from starlette.concurrency import run_in_threadpool
@app.get("/answer")
async def answer():
response = await run_in_threadpool(query_engine.query, "What changed?")
return StreamingResponse(response.response_gen)
4. Memory pressure during scaling
When replicas scale up and memory gets tight, Python processes can get killed or throttled before streaming completes.
# Example symptom: OOMKilled in Kubernetes logs
resources:
requests:
memory: "512Mi"
limits:
memory: "512Mi"
This often appears as a partial stream with no clean Python exception because the container died underneath you.
How to Debug It
- •
Check whether LlamaIndex finished generating
- •Log before and after
query_engine.query(...). - •If you never see the “finished” log line, the process is being interrupted before completion.
- •Look for class names like
StreamingResponse,ResponseSynthesizer, andQueryEngine.
- •Log before and after
- •
Remove all network layers
- •Call the same code locally in a plain Python script.
- •If it works there but fails behind FastAPI/Nginx/ALB, this is not an index issue.
- •You are debugging transport behavior, not retrieval logic.
- •
Disable streaming once
- •Switch from streaming to non-streaming output:
query_engine = index.as_query_engine(streaming=False) - •If full responses now work consistently, your problem is in stream handling or proxy buffering.
- •If even non-streaming truncates, inspect worker timeouts and process restarts.
- •Switch from streaming to non-streaming output:
- •
Inspect infra logs
- •Check Uvicorn/Gunicorn logs for worker restarts.
- •Check Kubernetes events for
OOMKilled. - •Check Nginx access/error logs for upstream disconnects and timeout messages.
Prevention
- •Build indexes and retrievers at app startup, not inside each request.
- •Keep streaming endpoints behind infrastructure configured for long-lived chunked responses.
- •Set explicit timeouts across app server, proxy, load balancer, and client so one layer does not kill streams early.
- •Test both
streaming=Trueandstreaming=Falsepaths before shipping to production.
If you want a stable production setup with LlamaIndex streaming, treat it like any other long-lived HTTP connection. The model output is only half of the problem; your server stack has to stay alive long enough to deliver every token.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit