How to Fix 'chain execution stuck in production' in LlamaIndex (Python)
When a LlamaIndex chain gets “stuck” in production, it usually means the request is not failing fast — it’s waiting on a tool call, retriever, LLM client, or event loop that never completes. In practice, you’ll see this when using QueryEngine, AgentRunner, or ChatEngine under load, especially after deploying code that worked fine in local notebooks.
The symptom is often one of these:
- •the API request hangs until your gateway times out
- •logs stop after
Starting query... - •you get no exception, just an open connection
- •occasionally you’ll see
asyncio.exceptions.TimeoutError,httpx.ReadTimeout, orRuntimeError: This event loop is already running
The Most Common Cause
The #1 cause is mixing sync and async execution incorrectly.
In LlamaIndex, many components expose both sync and async paths. The broken pattern is calling async methods from a sync request handler without awaiting them, or wrapping sync calls inside an already-running event loop. That creates requests that never complete or deadlock under production servers like FastAPI/Uvicorn.
Broken vs fixed
| Broken pattern | Fixed pattern |
|---|---|
Calls async method without await | Uses await end-to-end |
Uses .run() inside an active event loop | Uses native async handler |
| Blocks the event loop with sync I/O | Keeps the whole path async |
# BROKEN: sync handler calling async LlamaIndex code incorrectly
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
def handle_request(user_input: str):
# In production this may hang or return a coroutine object instead of a response
response = query_engine.aquery(user_input)
return {"answer": response}
# FIXED: async handler with proper await
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
async def handle_request(user_input: str):
response = await query_engine.aquery(user_input)
return {"answer": str(response)}
If you are using FastAPI, keep the whole request path async:
from fastapi import FastAPI
from llama_index.core import VectorStoreIndex
app = FastAPI()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
@app.post("/query")
async def query(payload: dict):
result = await query_engine.aquery(payload["text"])
return {"answer": str(result)}
If your stack shows RuntimeError: This event loop is already running, that’s the same class of problem. It usually happens when someone uses asyncio.run() inside an app server that already owns the loop.
Other Possible Causes
1) Missing timeout on LLM or retriever calls
A slow upstream can look like a stuck chain if there’s no timeout.
# BAD: no timeout configured
llm = OpenAI(model="gpt-4o-mini")
query_engine = index.as_query_engine(llm=llm)
# GOOD: set explicit client timeouts
from openai import AsyncOpenAI
from llama_index.llms.openai import OpenAI
client = AsyncOpenAI(timeout=30.0)
llm = OpenAI(model="gpt-4o-mini", api_client=client)
Also check vector DB clients and HTTP retrievers. A single slow network hop can stall the whole chain.
2) Tool call recursion in agents
Agents can loop forever if a tool keeps returning something that triggers the same tool again.
# BAD: tool output feeds back into the same agent loop endlessly
agent = ReActAgent.from_tools([search_tool, db_tool])
response = agent.chat("Keep searching until you find it")
If your prompts are vague and tools are broad, the agent may keep selecting the same tool. Add max iterations and tighter tool descriptions.
# GOOD: constrain iteration count and tools
agent_runner = AgentRunner(
agent=ReActAgent.from_tools([search_tool]),
max_iterations=5,
)
3) Large document ingestion blocking startup
If you build indexes at app startup with thousands of documents, your service may appear stuck before it starts serving traffic.
# BAD: heavy indexing during application boot
index = VectorStoreIndex.from_documents(huge_document_list)
query_engine = index.as_query_engine()
Move ingestion to a background job or precompute indexes offline.
# GOOD: load persisted index at runtime
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()
4) Deadlock from synchronous file/network calls inside custom callbacks
Custom callbacks or postprocessors that do blocking work can freeze the chain.
# BAD: blocking I/O inside callback path
def on_event(event):
requests.post("https://internal-audit/api/log", json=event.dict())
Use async clients or push events to a queue.
# GOOD: non-blocking event emission
import httpx
async def on_event(event):
async with httpx.AsyncClient(timeout=5.0) as client:
await client.post("https://internal-audit/api/log", json=event.dict())
How to Debug It
- •
Find where execution stops
- •Add logs before and after each major step:
- •input parsing
- •retrieval
- •synthesis
- •tool selection
- •If logs stop at
query_engine.aquery(...), you know the hang is inside LlamaIndex or its dependencies.
- •Add logs before and after each major step:
- •
Check for event loop misuse
- •Search for:
- •
asyncio.run(...) - •
.aquery(...)withoutawait - •
.chat(...)called from async code when an async variant exists
- •
- •If you see
RuntimeError: This event loop is already running, fix the call boundary first.
- •Search for:
- •
Force timeouts around every external call
- •Wrap queries with a timeout so “stuck” becomes visible:
import asyncio
result = await asyncio.wait_for(query_engine.aquery("test"), timeout=30)
- •If it times out consistently at one stage, that stage is your bottleneck.
- •Disable tools and simplify
- •Run the same prompt against:
- •plain
VectorStoreIndex - •then retriever only
- •then agent with one tool
- •plain
- •If the simple query works but the agent hangs, your issue is tool recursion or callback blocking.
- •Run the same prompt against:
Prevention
- •Keep sync and async paths consistent end-to-end. If your web server is async, use
awaitall the way through. - •Set explicit timeouts on LLMs, retrievers, vector DB clients, and outbound HTTP calls.
- •Prebuild indexes offline and load persisted storage in production instead of ingesting large corpora at startup.
- •Put hard limits on agents:
- •max iterations
- •narrow tool descriptions
- •strict output schemas where possible
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit