How to Fix 'streaming response cutoff in production' in LangChain (Python)
A streaming response cutoff in production error usually means your LangChain app started streaming tokens, then the stream stopped before the model finished. In practice, this shows up when you stream through an API gateway, reverse proxy, serverless runtime, or async handler that closes the connection too early.
The failure is rarely in LangChain itself. It’s usually a transport issue, an event-loop issue, or a timeout somewhere between your Python process and the client.
The Most Common Cause
The #1 cause is returning a streaming response from a framework that buffers or closes the connection before the generator finishes. With LangChain, this often happens when you use ChatOpenAI(streaming=True) or astream_events() inside FastAPI/Flask/Django without wiring the response as a true streaming body.
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Returns a normal JSON response after starting a stream | Returns a real streaming response object |
Starts ChatOpenAI(streaming=True) but never yields tokens to the client | Pipes chunks directly to StreamingResponse |
Often ends with errors like RuntimeError: Response content shorter than Content-Length or truncated output | Stream stays open until completion |
# BROKEN: starts streaming but doesn't actually stream to the HTTP client
from fastapi import FastAPI
from langchain_openai import ChatOpenAI
app = FastAPI()
llm = ChatOpenAI(model="gpt-4o-mini", streaming=True)
@app.get("/chat")
async def chat():
result = await llm.ainvoke("Write a short summary of PCI DSS.")
return {"text": result.content}
# FIXED: use a real streaming response
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
app = FastAPI()
llm = ChatOpenAI(model="gpt-4o-mini", streaming=True)
@app.get("/chat")
async def chat():
async def token_stream():
async for chunk in llm.astream([HumanMessage(content="Write a short summary of PCI DSS.")]):
if chunk.content:
yield chunk.content
return StreamingResponse(token_stream(), media_type="text/plain")
If you’re using LangChain callbacks directly, the same rule applies: don’t start streaming in memory and then return a buffered object. The client must consume the same stream you’re generating.
Other Possible Causes
1) Proxy timeout or buffering
Nginx, Cloudflare, ALB, API Gateway, and some ingress controllers buffer responses by default. If they don’t see enough data quickly, they cut the connection.
location /chat {
proxy_buffering off;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
If you’re behind AWS ALB or API Gateway, check idle timeout settings too. A model that pauses between chunks can still trigger a cutoff.
2) Serverless runtime limits
Lambda, Vercel functions, and similar environments often have execution or streaming limits. You may see partial output followed by abrupt termination.
# Bad fit for long-lived streams on serverless
@app.get("/stream")
async def stream():
return StreamingResponse(token_stream())
For serverless, keep streams short or move the endpoint to a containerized service with stable connections. LangChain can stream fine; the platform may not.
3) Async misuse inside sync code
A common LangChain mistake is mixing invoke()/stream() with asyncio.run() inside an already-running event loop. That can produce cutoffs or hard failures like:
- •
RuntimeError: asyncio.run() cannot be called from a running event loop - •
RuntimeError: Event loop is closed
# Wrong
def handler():
return asyncio.run(llm.ainvoke("Hello"))
# Right
async def handler():
return await llm.ainvoke("Hello")
If your framework is async-native, stay async end-to-end.
4) Callback handler exceptions
A custom callback that raises during token emission can stop the stream mid-response. This is easy to miss because the LLM call itself looks fine until callbacks fire.
from langchain_core.callbacks import BaseCallbackHandler
class BrokenHandler(BaseCallbackHandler):
def on_llm_new_token(self, token: str, **kwargs):
raise ValueError("logging failed")
Wrap callback logic defensively:
class SafeHandler(BaseCallbackHandler):
def on_llm_new_token(self, token: str, **kwargs):
try:
print(token)
except Exception:
pass
How to Debug It
- •
Confirm where the cutoff happens
- •Run the same chain locally in a plain script.
- •If
llm.invoke()works but streaming cuts off only behind HTTP, it’s transport-related. - •If local streaming also fails, inspect your LangChain code path first.
- •
Remove all middleware and callbacks
- •Disable custom callbacks, tracing hooks, and logging handlers.
- •Test with raw
ChatOpenAI+astream()only. - •If it stops failing, re-add components one by one.
- •
Check proxy and platform timeouts
- •Inspect Nginx
proxy_read_timeout, ALB idle timeout, Cloudflare limits. - •Check container health probes and ingress buffering.
- •Look for truncated responses in access logs.
- •Inspect Nginx
- •
Log chunk arrival times
- •Add timestamps per token/chunk.
- •If there’s a long gap before cutoff, you’re hitting an idle timeout.
- •If chunks stop immediately after a callback fires, that callback is breaking the stream.
import time
async def token_stream():
async for chunk in llm.astream([HumanMessage(content="Explain SRP in one paragraph.")]):
if chunk.content:
print(f"{time.time():.3f} chunk={chunk.content!r}")
yield chunk.content
Prevention
- •Use
StreamingResponseor an equivalent true streaming transport whenever you callstream(),astream(), orastream_events(). - •Keep long-lived streams off serverless runtimes unless you’ve verified their execution and idle timeout limits.
- •Treat callbacks as production code: catch exceptions inside them so observability never breaks user traffic.
If you want one rule to remember: LangChain can only stream as far as your HTTP stack allows it to stream. Most “cutoff” bugs are not model bugs; they’re response lifecycle bugs.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit