How to Fix 'chain execution stuck when scaling' in LlamaIndex (Python)
When chain execution stuck when scaling shows up in LlamaIndex, it usually means your pipeline is not dead — it is blocked. In practice, this happens when you scale from a single local run to multiple requests, larger indexes, or async execution and one part of the chain never returns.
The usual trigger is a mix of long-running retrieval, nested async calls, shared mutable state, or an LLM client that is waiting forever on a network call. If you are seeing WorkflowRuntimeError, asyncio.TimeoutError, or a chain that just hangs after RetrieverQueryEngine.query(), this is where to look.
The Most Common Cause
The #1 cause is mixing sync and async incorrectly inside a LlamaIndex chain.
A common broken pattern is calling blocking code inside an async workflow, or forgetting to await an async LlamaIndex call. Under load, this looks like the chain “stuck”, but the real issue is the event loop getting blocked or a coroutine never being awaited.
Broken vs fixed
| Broken pattern | Fixed pattern |
|---|---|
| Calls sync code inside async path | Uses proper async APIs end-to-end |
| Creates a new client per request | Reuses configured async client |
| Hides the hang behind retries | Fails fast with timeout handling |
# BROKEN
import asyncio
from llama_index.core import VectorStoreIndex
async def handle_query(query_engine, question: str):
# query() is sync; calling it inside async code can block the loop
response = query_engine.query(question)
return response
async def main():
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
result = await handle_query(query_engine, "What does the policy say?")
print(result)
asyncio.run(main())
# FIXED
import asyncio
from llama_index.core import VectorStoreIndex
async def handle_query(query_engine, question: str):
# Use the async API if you are already in async code
response = await query_engine.aquery(question)
return response
async def main():
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
result = await handle_query(query_engine, "What does the policy say?")
print(result)
asyncio.run(main())
If your chain uses Workflow, AgentWorkflow, or custom tools, the same rule applies: do not call blocking methods from inside an async step. That is how you end up with hangs that only appear once concurrency increases.
Other Possible Causes
1. No timeout on the LLM or embedding client
If the upstream model stalls, LlamaIndex will wait unless you set explicit timeouts.
# Example for OpenAI-style clients used by LlamaIndex integrations
llm_kwargs = {
"timeout": 30,
"max_retries": 2,
}
If you are using an HTTP-based embedding backend, set a similar timeout there too. A missing timeout often shows up as a request that never completes under peak traffic.
2. Shared mutable state in callbacks or memory
If multiple requests mutate the same memory object, chat store, or callback handler, one request can block another.
# BROKEN: shared global state
memory = ChatMemoryBuffer.from_defaults()
def build_chat_engine(index):
return index.as_chat_engine(memory=memory)
# FIXED: isolate per-session state
def build_chat_engine(index):
memory = ChatMemoryBuffer.from_defaults()
return index.as_chat_engine(memory=memory)
This matters in multi-user systems. A single global ChatMemoryBuffer, retriever cache, or custom callback handler can become a hidden bottleneck.
3. Over-aggressive concurrency during retrieval
If you fan out too many retrieval calls at once, your vector DB or embedding service can throttle and stall.
# Too much parallelism
settings.num_output = 1024
retriever = index.as_retriever(similarity_top_k=50)
Tune down batch size and fan-out:
retriever = index.as_retriever(similarity_top_k=5)
If you are using SentenceTransformerEmbedding or remote embeddings, also reduce embedding batch size. Large batches look efficient until they saturate CPU or network I/O and freeze the pipeline.
4. Recursive tool calls in agents
An agent can keep calling tools without reaching a stop condition. In LlamaIndex this often surfaces as repeated tool execution before the final answer ever arrives.
# Check your agent settings carefully
agent = ReActAgent.from_tools(
tools=tools,
max_iterations=5,
)
If max_iterations is too high and your tool descriptions are vague, the agent may loop instead of finishing. That looks like “stuck when scaling” because longer context and more tools increase the chance of bad routing.
How to Debug It
- •
Turn on verbose logging
- •Enable LlamaIndex debug logs and inspect where execution stops.
- •Look for classes like
RetrieverQueryEngine,ResponseSynthesizer,ReActAgent, orWorkflow.
- •
Add hard timeouts around each stage
- •Wrap retrieval and generation separately.
- •If retrieval times out but generation does not start, your bottleneck is upstream.
import asyncio
response = await asyncio.wait_for(query_engine.aquery("test"), timeout=20)
- •
Reduce to one request and one document
- •Test with a single document and
similarity_top_k=1. - •If it works locally but fails under load, the issue is concurrency or shared state.
- •Test with a single document and
- •
Swap components one by one
- •Replace your vector store with an in-memory index.
- •Replace your custom LLM wrapper with a known-good provider.
- •The failing component usually becomes obvious fast.
Prevention
- •Use async consistently: if your app entrypoint is async, use
aquery(),achat(), and async tool functions all the way through. - •Set explicit timeouts on LLMs, embeddings, HTTP clients, and workflow steps.
- •Avoid global mutable objects like shared memory buffers, retrievers, caches, and callback handlers in web apps.
If you want this class of bug to stay out of production, treat every LlamaIndex chain like distributed systems code. Once you scale beyond one request at a time, hangs are usually coordination bugs — not model bugs.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit