How to Fix 'streaming response cutoff in production' in LangGraph (Python)

By Cyprian AaronsUpdated 2026-04-21
streaming-response-cutoff-in-productionlanggraphpython

What this error actually means

streaming response cutoff in production usually means your LangGraph app started streaming tokens or events, then the stream got terminated before the full response finished. In practice, this shows up when you deploy a graph behind an HTTP server, proxy, or worker setup that does not keep the connection open long enough.

You’ll typically see it when using graph.stream() or graph.astream() in a FastAPI/ASGI app, especially under Gunicorn, Uvicorn workers, reverse proxies, or serverless runtimes.

The Most Common Cause

The #1 cause is returning a streaming generator from a request handler without keeping the ASGI response alive correctly. In production, something upstream closes the socket early, and LangGraph’s stream gets cut off mid-flight.

Here’s the broken pattern:

# broken.py
from fastapi import FastAPI
from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI

app = FastAPI()
llm = ChatOpenAI(model="gpt-4o-mini")

def build_graph():
    # ... graph construction omitted ...
    return graph

graph = build_graph()

@app.get("/chat")
async def chat():
    # BAD: returning a raw async generator without proper streaming response handling
    async def event_stream():
        async for chunk in graph.astream({"messages": []}):
            yield f"data: {chunk}\n\n"

    return event_stream()

And here’s the fixed pattern:

# fixed.py
from fastapi import FastAPI
from starlette.responses import StreamingResponse
from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI

app = FastAPI()
llm = ChatOpenAI(model="gpt-4o-mini")

def build_graph():
    # ... graph construction omitted ...
    return graph

graph = build_graph()

@app.get("/chat")
async def chat():
    async def event_stream():
        async for chunk in graph.astream({"messages": []}):
            yield f"data: {chunk}\n\n"

    return StreamingResponse(event_stream(), media_type="text/event-stream")

The difference is simple:

  • return event_stream() gives FastAPI an async generator object, not a proper streaming response.
  • StreamingResponse(...) keeps the connection open and handles the ASGI lifecycle correctly.

If you’re using LangGraph’s event APIs, also make sure you’re consuming them correctly:

  • graph.stream(...) for sync iteration
  • graph.astream(...) for async iteration
  • graph.invoke(...) / graph.ainvoke(...) if you do not need streaming

Other Possible Causes

1) Your reverse proxy buffers or times out the stream

Nginx and similar proxies often buffer responses by default. That breaks token streaming even if your Python code is correct.

location /chat {
    proxy_pass http://app;
    proxy_buffering off;
    proxy_read_timeout 3600;
    proxy_send_timeout 3600;
}

If buffering stays on, the client may only get part of the stream or nothing until the connection closes.

2) Gunicorn worker timeout kills long streams

If you run Uvicorn under Gunicorn with default timeouts, long LLM calls can get killed mid-stream.

gunicorn app:app \
  -k uvicorn.workers.UvicornWorker \
  --workers 2 \
  --timeout 120

For streaming endpoints that may take longer than usual, increase --timeout. If you use Kubernetes or another orchestrator, check pod-level request timeouts too.

3) You are mixing sync and async graph calls

A common mistake is calling stream() inside an async route or calling astream() without awaiting/iterating it properly.

# broken
@app.get("/bad")
async def bad():
    for chunk in graph.stream({"messages": []}):
        print(chunk)

Use one mode consistently:

# fixed
@app.get("/good")
async def good():
    async for chunk in graph.astream({"messages": []}):
        print(chunk)

If your code path blocks the event loop with CPU work or sync I/O, the stream can stall and get cut off by upstream infrastructure.

4) The model provider stops sending tokens early

Sometimes the issue is not LangGraph at all. The underlying LLM call may fail with rate limits, context length issues, or provider-side disconnects.

Watch for errors like:

  • openai.APIConnectionError
  • openai.RateLimitError
  • httpx.ReadTimeout
  • langchain_core.exceptions.OutputParserException

Example:

llm = ChatOpenAI(
    model="gpt-4o-mini",
    timeout=60,
    max_retries=2,
)

If retries are too low or timeout too aggressive, partial streams can look like a LangGraph cutoff.

How to Debug It

  1. Reproduce locally without proxies

    • Run the app directly with Uvicorn.
    • Hit the endpoint from curl:
      curl -N http://localhost:8000/chat
      
    • If it works locally but fails in prod, suspect proxy or timeout settings.
  2. Check whether you are using the right LangGraph API

    • Streaming:
      • graph.stream(...)
      • graph.astream(...)
    • Non-streaming:
      • graph.invoke(...)
      • graph.ainvoke(...)
    • If your endpoint does not need incremental tokens, stop streaming entirely.
  3. Inspect server and proxy logs

    • Look for:
      • 499 Client Closed Request
      • 504 Gateway Timeout
      • worker restarts
      • connection reset errors
    • These usually point to infrastructure cutting off the response rather than LangGraph itself.
  4. Add timing around each step

    import time
    
    start = time.time()
    async for chunk in graph.astream({"messages": []}):
        print("chunk after", time.time() - start)
        print(chunk)
    

    If chunks stop arriving after a specific node or tool call, that node is likely blocking or failing.

Prevention

  • Use StreamingResponse or SSE/WebSocket plumbing that matches your framework.
  • Set explicit timeouts at every layer:
    • LLM client timeout
    • ASGI server timeout
    • reverse proxy timeout
    • load balancer idle timeout
  • Prefer non-streaming execution unless you actually need token-by-token output.
  • Test behind production-like infrastructure early:
    • Nginx
    • Gunicorn/Uvicorn combo
    • cloud load balancer
    • container orchestration timeouts

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides