How to Fix 'callback not firing when scaling' in LlamaIndex (Python)
When callback not firing when scaling shows up in a LlamaIndex Python app, it usually means your callback handler is attached in one execution path but not the one actually doing the work. In practice, this happens when you move from a single local query to async, multi-worker, or distributed execution and assume the same callback chain is still intact.
The symptom is usually one of these:
- •
CallbackManagerevents never reach your handler - •
on_event_start/on_event_endnever fire - •traces work locally, then disappear once you add concurrency or scale out
The Most Common Cause
The #1 cause is creating the CallbackManager or handler in one place, but running the index/query in another process, thread, or instance that never received it.
This shows up a lot with:
- •
asyncio.gather(...) - •FastAPI background tasks
- •Celery / RQ workers
- •multiple Uvicorn/Gunicorn workers
- •object serialization across process boundaries
Broken pattern vs fixed pattern
| Broken | Fixed |
|---|---|
| callback handler created in the main process only | callback handler passed into every worker/task |
| global state assumed to survive scaling | explicit dependency injection |
| index built with one manager, queried with another | same manager instance used consistently |
# BROKEN: callback exists only in the parent process
from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
debug_handler = LlamaDebugHandler()
Settings.callback_manager = CallbackManager([debug_handler])
def build_index(docs):
# In a worker/process boundary, this may not inherit Settings as expected
return VectorStoreIndex.from_documents(docs)
def run_query(index):
# You expect callbacks here, but they may never hit debug_handler
return index.as_query_engine().query("What is in the docs?")
# FIXED: pass callback manager explicitly through the code path
from llama_index.core import VectorStoreIndex
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
debug_handler = LlamaDebugHandler()
callback_manager = CallbackManager([debug_handler])
def build_index(docs):
return VectorStoreIndex.from_documents(
docs,
callback_manager=callback_manager,
)
def run_query(index):
query_engine = index.as_query_engine(callback_manager=callback_manager)
return query_engine.query("What is in the docs?")
If you are using workers, do not rely on module-level globals. Instantiate the handler inside the worker startup path or pass config into each task.
Other Possible Causes
1) You attached the handler to the wrong object
Some LlamaIndex components accept callbacks at construction time, but others need them on the query engine or retriever.
# BROKEN
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
query_engine.query("foo") # no callback manager attached here
# FIXED
query_engine = index.as_query_engine(callback_manager=callback_manager)
query_engine.query("foo")
2) Async code is swallowing exceptions before callbacks complete
If an async task fails early, you may see partial events or none at all.
# BROKEN
tasks = [engine.aquery(q) for q in queries]
results = await asyncio.gather(*tasks) # one exception can short-circuit visibility
# FIXED
results = await asyncio.gather(*tasks, return_exceptions=True)
for result in results:
if isinstance(result, Exception):
logger.exception("Query failed", exc_info=result)
3) Your custom callback handler does not implement the right methods
In LlamaIndex, custom handlers should match the event interface. If your class misses required hooks like on_event_start and on_event_end, nothing useful gets recorded.
# BROKEN
from llama_index.core.callbacks.base import BaseCallbackHandler
class MyHandler(BaseCallbackHandler):
def __init__(self):
pass # missing event methods / required init behavior
# FIXED
from llama_index.core.callbacks.base import BaseCallbackHandler, CBEventType
class MyHandler(BaseCallbackHandler):
def __init__(self):
super().__init__()
def on_event_start(self, event_type: CBEventType, payload=None, event_id="", parent_id="", **kwargs):
print(f"start: {event_type}")
def on_event_end(self, event_type: CBEventType, payload=None, event_id="", parent_id="", **kwargs):
print(f"end: {event_type}")
4) Worker processes don’t share memory with your main app
This is common with Gunicorn/Uvicorn workers. Each worker has its own Python process and its own Settings.
# Example: multiple workers mean multiple independent callback contexts
gunicorn app:app --workers 4
If your debug handler lives only in worker 1 during startup, worker 2/3/4 will never use it unless they initialize their own copy.
How to Debug It
- •
Confirm where the callback manager is attached
- •Print it right before query execution.
- •Verify both index creation and query execution use the same instance.
print("index cb:", getattr(index._service_context if hasattr(index, "_service_context") else None)) print("settings cb:", Settings.callback_manager) - •
Use
LlamaDebugHandlerfirst- •Strip out custom handlers.
- •If
LlamaDebugHandlerworks but yours doesn’t, your implementation is wrong. - •If neither works, your problem is attachment or process isolation.
- •
Check for process boundaries
- •Look for Celery tasks, background jobs, async queues, or multiple web workers.
- •If code runs outside the request process that set
Settings.callback_manager, global state will not follow.
- •
Force a minimal synchronous repro
- •Run one document.
- •One query.
- •One process.
- •No background tasks.
If callbacks fire there but fail under load, you have a scaling/context propagation issue rather than a LlamaIndex bug.
Prevention
- •Attach callbacks explicitly to each component that emits events:
- •index creation
- •retriever construction
- •query engine creation
- •Avoid relying on global
Settings.callback_manageracross worker boundaries. - •Add a startup check that logs whether your active query engine has the expected callback manager before serving traffic.
If you want this to survive production load, treat callback wiring like database config: explicit per execution path, not ambient state. That’s the difference between “works on my laptop” and observability that still works at scale.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit