LlamaIndex Tutorial (Python): adding cost tracking for advanced developers

By Cyprian AaronsUpdated 2026-04-21
llamaindexadding-cost-tracking-for-advanced-developerspython

This tutorial shows you how to add token-based cost tracking to a LlamaIndex Python workflow so every LLM call is measured, logged, and attributed to the right part of your application. You need this when you’re building agentic or retrieval-heavy systems and want hard numbers for model spend instead of guessing from vendor dashboards.

What You'll Need

  • Python 3.10+
  • llama-index
  • An OpenAI API key in OPENAI_API_KEY
  • A shell environment for setting environment variables
  • Basic familiarity with VectorStoreIndex, QueryEngine, and LlamaIndex callbacks

Install the package:

pip install llama-index

Step-by-Step

  1. Start by wiring up a callback handler that tracks token usage and estimated cost per event. LlamaIndex exposes callback plumbing through its global settings, which makes this easy to apply across your app without changing every query call.
import os
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.llms.openai import OpenAI

os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "")

token_handler = TokenCountingHandler()
Settings.callback_manager = CallbackManager([token_handler])
Settings.llm = OpenAI(model="gpt-4o-mini")
  1. Load some data and build an index normally. The point here is that cost tracking should be invisible to your retrieval logic; it should sit underneath the index and capture every completion call.
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("Summarize the main risks mentioned in these documents.")
print(response)
  1. Read the token counts after the query finishes. For advanced use cases, this is where you can attach the numbers to request IDs, tenant IDs, or workflow spans before shipping them to your observability stack.
print("Prompt tokens:", token_handler.prompt_llm_token_count)
print("Completion tokens:", token_handler.completion_llm_token_count)
print("Total tokens:", token_handler.total_llm_token_count)
print("Embedding tokens:", token_handler.total_embedding_token_count)
  1. If you want actual dollar estimates, compute them from model pricing yourself. LlamaIndex gives you the usage data; your app should own the pricing table because vendor rates change and different models have different input/output costs.
def estimate_cost(prompt_tokens: int, completion_tokens: int) -> float:
    input_rate = 0.15 / 1_000_000   # example: $0.15 per 1M input tokens
    output_rate = 0.60 / 1_000_000  # example: $0.60 per 1M output tokens
    return (prompt_tokens * input_rate) + (completion_tokens * output_rate)

estimated = estimate_cost(
    token_handler.prompt_llm_token_count,
    token_handler.completion_llm_token_count,
)
print(f"Estimated cost: ${estimated:.6f}")
  1. Reset counters between requests if you’re serving traffic in a web app or agent loop. Without this, one user’s usage bleeds into the next request and your accounting becomes useless.
token_handler.reset_counts()

response = query_engine.query("List any compliance issues mentioned.")
print(response)

print("Prompt tokens after reset:", token_handler.prompt_llm_token_count)
print("Completion tokens after reset:", token_handler.completion_llm_token_count)

Testing It

Run two different queries back-to-back and compare the counters before and after reset_counts(). You should see totals increase after each query, then return to zero once reset.

If you’re using real documents in data/, make sure the index builds successfully and that the query produces a non-empty response. Then check that prompt and completion token counts are both non-zero for at least one query.

For production validation, wrap each request in a request-scoped object and log:

  • user ID or tenant ID
  • query text hash
  • prompt tokens
  • completion tokens
  • estimated cost

That gives you an audit trail you can reconcile against provider billing later.

Next Steps

  • Add a custom callback handler to emit usage metrics into Prometheus, Datadog, or OpenTelemetry.
  • Track costs per tool call inside agent workflows, not just per top-level query.
  • Store per-request usage records in Postgres so finance and engineering can both inspect spend by tenant, feature, or route.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides