How to Fix 'streaming response cutoff when scaling' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
streaming-response-cutoff-when-scalingllamaindexpython

A streaming response cutoff when scaling error in LlamaIndex usually means your streamed output is being interrupted before the final tokens are flushed to the client. In practice, it shows up when you move from local testing to multiple workers, async handlers, reverse proxies, or a serverless runtime that does not keep the connection open long enough.

The key detail: this is usually not a “LlamaIndex bug” in the core retrieval pipeline. It is almost always a transport, lifecycle, or buffering problem around StreamingResponse, ResponseSynthesizer, or your web server.

The Most Common Cause

The #1 cause is returning a stream from a request handler while the underlying generator depends on objects that get garbage-collected, closed, or outlived by the request scope.

This happens a lot when people build a query engine with streaming=True, then wrap it in FastAPI or another ASGI app and return the iterator directly without keeping the response source alive.

Broken vs fixed pattern

Broken patternFixed pattern
Creates the stream inside a short-lived scope and returns an iterator tied to itKeeps the generator and query engine alive for the full request
Often fails under load balancing or multiple workersWorks reliably with ASGI streaming
Can surface as truncated output or StreamingResponse cutoffFlushes tokens until completion
# BROKEN
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from llama_index.core import VectorStoreIndex

app = FastAPI()

@app.get("/chat")
def chat():
    index = VectorStoreIndex.from_documents(docs)
    query_engine = index.as_query_engine(streaming=True)

    # This stream is tied to local objects that may not survive cleanly
    response = query_engine.query("Summarize the policy changes.")
    return StreamingResponse(response.response_gen, media_type="text/plain")
# FIXED
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from llama_index.core import VectorStoreIndex

app = FastAPI()

index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(streaming=True)

@app.get("/chat")
async def chat():
    response = query_engine.query("Summarize the policy changes.")

    def token_stream():
        for token in response.response_gen:
            yield token

    return StreamingResponse(token_stream(), media_type="text/plain")

If you are using StreamingResponse, keep the stream generator simple and ensure the response_gen remains valid for the full duration of the request. In real deployments, I also recommend building the index at startup, not per-request.

Other Possible Causes

1. Reverse proxy buffering

Nginx, ALB, Cloudflare, or an API gateway can buffer chunks and make streaming look like it cut off.

location /chat {
    proxy_buffering off;
    proxy_cache off;
    chunked_transfer_encoding on;
}

If buffering is enabled, tokens may arrive late or stop appearing after a certain payload size.

2. Worker timeout killing long streams

Gunicorn/Uvicorn worker timeouts will terminate long-running streams mid-response.

gunicorn app:app \
  -k uvicorn.workers.UvicornWorker \
  --workers 4 \
  --timeout 120

If your RAG chain takes longer than the default timeout, you will see incomplete output even though LlamaIndex was still generating.

3. Async/sync mismatch in your handler

Calling sync LlamaIndex code from an async endpoint can block event loop progress and interrupt streaming behavior under concurrency.

# BAD: blocking sync call inside async route
@app.get("/answer")
async def answer():
    response = query_engine.query("What changed?")
    return StreamingResponse(response.response_gen)

Use an async-friendly path if your stack supports it, or move blocking work into a thread pool.

# BETTER: isolate blocking work
from starlette.concurrency import run_in_threadpool

@app.get("/answer")
async def answer():
    response = await run_in_threadpool(query_engine.query, "What changed?")
    return StreamingResponse(response.response_gen)

4. Memory pressure during scaling

When replicas scale up and memory gets tight, Python processes can get killed or throttled before streaming completes.

# Example symptom: OOMKilled in Kubernetes logs
resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "512Mi"

This often appears as a partial stream with no clean Python exception because the container died underneath you.

How to Debug It

  1. Check whether LlamaIndex finished generating

    • Log before and after query_engine.query(...).
    • If you never see the “finished” log line, the process is being interrupted before completion.
    • Look for class names like StreamingResponse, ResponseSynthesizer, and QueryEngine.
  2. Remove all network layers

    • Call the same code locally in a plain Python script.
    • If it works there but fails behind FastAPI/Nginx/ALB, this is not an index issue.
    • You are debugging transport behavior, not retrieval logic.
  3. Disable streaming once

    • Switch from streaming to non-streaming output:
      query_engine = index.as_query_engine(streaming=False)
      
    • If full responses now work consistently, your problem is in stream handling or proxy buffering.
    • If even non-streaming truncates, inspect worker timeouts and process restarts.
  4. Inspect infra logs

    • Check Uvicorn/Gunicorn logs for worker restarts.
    • Check Kubernetes events for OOMKilled.
    • Check Nginx access/error logs for upstream disconnects and timeout messages.

Prevention

  • Build indexes and retrievers at app startup, not inside each request.
  • Keep streaming endpoints behind infrastructure configured for long-lived chunked responses.
  • Set explicit timeouts across app server, proxy, load balancer, and client so one layer does not kill streams early.
  • Test both streaming=True and streaming=False paths before shipping to production.

If you want a stable production setup with LlamaIndex streaming, treat it like any other long-lived HTTP connection. The model output is only half of the problem; your server stack has to stay alive long enough to deliver every token.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides