LlamaIndex Tutorial (Python): adding observability for advanced developers

By Cyprian AaronsUpdated 2026-04-21

llamaindexadding-observability-for-advanced-developerspython

This tutorial shows how to add observability to a LlamaIndex Python app so you can trace retrieval, LLM calls, and response quality end-to-end. You need this when a query returns the wrong answer, latency spikes, or you need auditability for production debugging.

What You'll Need

•Python 3.10+
•llama-index
•llama-index-core
•llama-index-llms-openai
•llama-index-embeddings-openai
•An OpenAI API key
•A LlamaCloud/LlamaIndex observability account if you want hosted tracing
•Optional: trulens, phoenix, or your own OpenTelemetry stack if you want to export traces elsewhere

Step-by-Step

•Install the packages and set your API key. Keep the dependencies explicit so you can pin versions in production and avoid surprise breakage.

pip install llama-index llama-index-core llama-index-llms-openai llama-index-embeddings-openai
export OPENAI_API_KEY="your-openai-api-key"

•Build a minimal index first, then add instrumentation around it. The important part is not the index itself; it’s making sure every retrieval and synthesis step is traceable.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the main risk factors.")
print(response)

•Turn on LlamaIndex callback tracing. This gives you structured spans for indexing, retrieval, and LLM calls without changing your application logic.

from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

debug_handler = LlamaDebugHandler(print_trace_on_end=True)
Settings.callback_manager = CallbackManager([debug_handler])
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)
print(query_engine.query("What are the policy exclusions?"))

•Add a custom callback handler if you want application-specific observability. This is where you capture latency, token usage, tenant IDs, request IDs, or anything your ops team needs.

import time
from typing import Any, Dict, List
from llama_index.core.callbacks.base_handler import BaseCallbackHandler
from llama_index.core.callbacks.schema import CBEventType

class MetricsHandler(BaseCallbackHandler):
    def __init__(self):
        super().__init__()
        self.starts: Dict[str, float] = {}

    def on_event_start(self, event_type: CBEventType, payload: Dict[str, Any], event_id: str, parent_id: str | None = None, **kwargs):
        self.starts[event_id] = time.time()

    def on_event_end(self, event_type: CBEventType, payload: Dict[str, Any], event_id: str, **kwargs):
        elapsed_ms = (time.time() - self.starts.pop(event_id)) * 1000
        print(f"{event_type.value}: {elapsed_ms:.1f} ms")

    def start_trace(self, trace_id: str | None = None) -> None:
        pass

    def end_trace(self, trace_id: str | None = None) -> None:
        pass

handler = MetricsHandler()

•Wire the custom handler into the same callback manager and run a query under a trace boundary. In production you would replace print() with structured logs or an export to your metrics backend.

from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.callbacks import CallbackManager
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.callback_manager = CallbackManager([handler])
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

with Settings.callback_manager.as_trace("policy-qna"):
    answer = query_engine.query("What does the deductible cover?")
    print(answer)

Testing It

Run the script against a small local document set first so you can see clean traces without noise from unrelated content. You should see callback output for indexing and query execution, plus the final answer printed at the end.

If the debug handler is wired correctly, each run will emit trace information that lets you spot whether latency comes from embedding creation, retrieval fan-out, or response synthesis. If your custom handler is working, you’ll also see timing lines for each event type.

For production validation, send one known query with a stable expected answer and confirm that your logs contain the same trace ID across all related events. That makes it much easier to correlate user reports with backend behavior.

Next Steps

•Export callback events to OpenTelemetry so traces show up in Datadog, Grafana Tempo, or Honeycomb.
•Add per-request metadata like tenant ID and document corpus version to your callback payloads.
•Compare this with hosted observability tools like Phoenix or TruLens if you need evaluation dashboards on top of traces.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit