How to Fix 'chain execution stuck in production' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21

chain-execution-stuck-in-productionllamaindexpython

When a LlamaIndex chain gets “stuck” in production, it usually means the request is not failing fast — it’s waiting on a tool call, retriever, LLM client, or event loop that never completes. In practice, you’ll see this when using QueryEngine, AgentRunner, or ChatEngine under load, especially after deploying code that worked fine in local notebooks.

The symptom is often one of these:

•the API request hangs until your gateway times out
•logs stop after Starting query...
•you get no exception, just an open connection
•occasionally you’ll see asyncio.exceptions.TimeoutError, httpx.ReadTimeout, or RuntimeError: This event loop is already running

The Most Common Cause

The #1 cause is mixing sync and async execution incorrectly.

In LlamaIndex, many components expose both sync and async paths. The broken pattern is calling async methods from a sync request handler without awaiting them, or wrapping sync calls inside an already-running event loop. That creates requests that never complete or deadlock under production servers like FastAPI/Uvicorn.

Broken vs fixed

Broken pattern	Fixed pattern
Calls async method without `await`	Uses `await` end-to-end
Uses `.run()` inside an active event loop	Uses native async handler
Blocks the event loop with sync I/O	Keeps the whole path async

# BROKEN: sync handler calling async LlamaIndex code incorrectly

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

def handle_request(user_input: str):
    # In production this may hang or return a coroutine object instead of a response
    response = query_engine.aquery(user_input)
    return {"answer": response}

# FIXED: async handler with proper await

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

async def handle_request(user_input: str):
    response = await query_engine.aquery(user_input)
    return {"answer": str(response)}

If you are using FastAPI, keep the whole request path async:

from fastapi import FastAPI
from llama_index.core import VectorStoreIndex

app = FastAPI()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

@app.post("/query")
async def query(payload: dict):
    result = await query_engine.aquery(payload["text"])
    return {"answer": str(result)}

If your stack shows RuntimeError: This event loop is already running, that’s the same class of problem. It usually happens when someone uses asyncio.run() inside an app server that already owns the loop.

Other Possible Causes

1) Missing timeout on LLM or retriever calls

A slow upstream can look like a stuck chain if there’s no timeout.

# BAD: no timeout configured
llm = OpenAI(model="gpt-4o-mini")
query_engine = index.as_query_engine(llm=llm)

# GOOD: set explicit client timeouts
from openai import AsyncOpenAI
from llama_index.llms.openai import OpenAI

client = AsyncOpenAI(timeout=30.0)
llm = OpenAI(model="gpt-4o-mini", api_client=client)

Also check vector DB clients and HTTP retrievers. A single slow network hop can stall the whole chain.

2) Tool call recursion in agents

Agents can loop forever if a tool keeps returning something that triggers the same tool again.

# BAD: tool output feeds back into the same agent loop endlessly
agent = ReActAgent.from_tools([search_tool, db_tool])
response = agent.chat("Keep searching until you find it")

If your prompts are vague and tools are broad, the agent may keep selecting the same tool. Add max iterations and tighter tool descriptions.

# GOOD: constrain iteration count and tools
agent_runner = AgentRunner(
    agent=ReActAgent.from_tools([search_tool]),
    max_iterations=5,
)

3) Large document ingestion blocking startup

If you build indexes at app startup with thousands of documents, your service may appear stuck before it starts serving traffic.

# BAD: heavy indexing during application boot
index = VectorStoreIndex.from_documents(huge_document_list)
query_engine = index.as_query_engine()

Move ingestion to a background job or precompute indexes offline.

# GOOD: load persisted index at runtime
from llama_index.core import StorageContext, load_index_from_storage

storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()

4) Deadlock from synchronous file/network calls inside custom callbacks

Custom callbacks or postprocessors that do blocking work can freeze the chain.

# BAD: blocking I/O inside callback path
def on_event(event):
    requests.post("https://internal-audit/api/log", json=event.dict())

Use async clients or push events to a queue.

# GOOD: non-blocking event emission
import httpx

async def on_event(event):
    async with httpx.AsyncClient(timeout=5.0) as client:
        await client.post("https://internal-audit/api/log", json=event.dict())

How to Debug It

•
Find where execution stops
- •
  Add logs before and after each major step:
  - •input parsing
  - •retrieval
  - •synthesis
  - •tool selection
- •If logs stop at query_engine.aquery(...), you know the hang is inside LlamaIndex or its dependencies.
•
Check for event loop misuse
- •
  Search for:
  - •asyncio.run(...)
  - •.aquery(...) without await
  - •.chat(...) called from async code when an async variant exists
- •If you see RuntimeError: This event loop is already running, fix the call boundary first.
•
Force timeouts around every external call
- •Wrap queries with a timeout so “stuck” becomes visible:

import asyncio

result = await asyncio.wait_for(query_engine.aquery("test"), timeout=30)

•If it times out consistently at one stage, that stage is your bottleneck.

•
Disable tools and simplify
- •
  Run the same prompt against:
  - •plain VectorStoreIndex
  - •then retriever only
  - •then agent with one tool
- •If the simple query works but the agent hangs, your issue is tool recursion or callback blocking.

Prevention

•Keep sync and async paths consistent end-to-end. If your web server is async, use await all the way through.
•Set explicit timeouts on LLMs, retrievers, vector DB clients, and outbound HTTP calls.
•Prebuild indexes offline and load persisted storage in production instead of ingesting large corpora at startup.
•
Put hard limits on agents:
- •max iterations
- •narrow tool descriptions
- •strict output schemas where possible

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit