How to Fix 'chain execution stuck when scaling' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21

chain-execution-stuck-when-scalingllamaindexpython

When chain execution stuck when scaling shows up in LlamaIndex, it usually means your pipeline is not dead — it is blocked. In practice, this happens when you scale from a single local run to multiple requests, larger indexes, or async execution and one part of the chain never returns.

The usual trigger is a mix of long-running retrieval, nested async calls, shared mutable state, or an LLM client that is waiting forever on a network call. If you are seeing WorkflowRuntimeError, asyncio.TimeoutError, or a chain that just hangs after RetrieverQueryEngine.query(), this is where to look.

The Most Common Cause

The #1 cause is mixing sync and async incorrectly inside a LlamaIndex chain.

A common broken pattern is calling blocking code inside an async workflow, or forgetting to await an async LlamaIndex call. Under load, this looks like the chain “stuck”, but the real issue is the event loop getting blocked or a coroutine never being awaited.

Broken vs fixed

Broken pattern	Fixed pattern
Calls sync code inside async path	Uses proper async APIs end-to-end
Creates a new client per request	Reuses configured async client
Hides the hang behind retries	Fails fast with timeout handling

# BROKEN
import asyncio
from llama_index.core import VectorStoreIndex

async def handle_query(query_engine, question: str):
    # query() is sync; calling it inside async code can block the loop
    response = query_engine.query(question)
    return response

async def main():
    index = VectorStoreIndex.from_documents(docs)
    query_engine = index.as_query_engine()
    result = await handle_query(query_engine, "What does the policy say?")
    print(result)

asyncio.run(main())

# FIXED
import asyncio
from llama_index.core import VectorStoreIndex

async def handle_query(query_engine, question: str):
    # Use the async API if you are already in async code
    response = await query_engine.aquery(question)
    return response

async def main():
    index = VectorStoreIndex.from_documents(docs)
    query_engine = index.as_query_engine()
    result = await handle_query(query_engine, "What does the policy say?")
    print(result)

asyncio.run(main())

If your chain uses Workflow, AgentWorkflow, or custom tools, the same rule applies: do not call blocking methods from inside an async step. That is how you end up with hangs that only appear once concurrency increases.

Other Possible Causes

1. No timeout on the LLM or embedding client

If the upstream model stalls, LlamaIndex will wait unless you set explicit timeouts.

# Example for OpenAI-style clients used by LlamaIndex integrations
llm_kwargs = {
    "timeout": 30,
    "max_retries": 2,
}

If you are using an HTTP-based embedding backend, set a similar timeout there too. A missing timeout often shows up as a request that never completes under peak traffic.

2. Shared mutable state in callbacks or memory

If multiple requests mutate the same memory object, chat store, or callback handler, one request can block another.

# BROKEN: shared global state
memory = ChatMemoryBuffer.from_defaults()

def build_chat_engine(index):
    return index.as_chat_engine(memory=memory)

# FIXED: isolate per-session state
def build_chat_engine(index):
    memory = ChatMemoryBuffer.from_defaults()
    return index.as_chat_engine(memory=memory)

This matters in multi-user systems. A single global ChatMemoryBuffer, retriever cache, or custom callback handler can become a hidden bottleneck.

3. Over-aggressive concurrency during retrieval

If you fan out too many retrieval calls at once, your vector DB or embedding service can throttle and stall.

# Too much parallelism
settings.num_output = 1024
retriever = index.as_retriever(similarity_top_k=50)

Tune down batch size and fan-out:

retriever = index.as_retriever(similarity_top_k=5)

If you are using SentenceTransformerEmbedding or remote embeddings, also reduce embedding batch size. Large batches look efficient until they saturate CPU or network I/O and freeze the pipeline.

4. Recursive tool calls in agents

An agent can keep calling tools without reaching a stop condition. In LlamaIndex this often surfaces as repeated tool execution before the final answer ever arrives.

# Check your agent settings carefully
agent = ReActAgent.from_tools(
    tools=tools,
    max_iterations=5,
)

If max_iterations is too high and your tool descriptions are vague, the agent may loop instead of finishing. That looks like “stuck when scaling” because longer context and more tools increase the chance of bad routing.

How to Debug It

•
Turn on verbose logging
- •Enable LlamaIndex debug logs and inspect where execution stops.
- •Look for classes like RetrieverQueryEngine, ResponseSynthesizer, ReActAgent, or Workflow.
•
Add hard timeouts around each stage
- •Wrap retrieval and generation separately.
- •If retrieval times out but generation does not start, your bottleneck is upstream.

import asyncio

response = await asyncio.wait_for(query_engine.aquery("test"), timeout=20)

•
Reduce to one request and one document
- •Test with a single document and similarity_top_k=1.
- •If it works locally but fails under load, the issue is concurrency or shared state.
•
Swap components one by one
- •Replace your vector store with an in-memory index.
- •Replace your custom LLM wrapper with a known-good provider.
- •The failing component usually becomes obvious fast.

Prevention

•Use async consistently: if your app entrypoint is async, use aquery(), achat(), and async tool functions all the way through.
•Set explicit timeouts on LLMs, embeddings, HTTP clients, and workflow steps.
•Avoid global mutable objects like shared memory buffers, retrievers, caches, and callback handlers in web apps.

If you want this class of bug to stay out of production, treat every LlamaIndex chain like distributed systems code. Once you scale beyond one request at a time, hangs are usually coordination bugs — not model bugs.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit