LlamaIndex Tutorial (Python): adding cost tracking for advanced developers
This tutorial shows you how to add token-based cost tracking to a LlamaIndex Python workflow so every LLM call is measured, logged, and attributed to the right part of your application. You need this when you’re building agentic or retrieval-heavy systems and want hard numbers for model spend instead of guessing from vendor dashboards.
What You'll Need
- •Python 3.10+
- •
llama-index - •An OpenAI API key in
OPENAI_API_KEY - •A shell environment for setting environment variables
- •Basic familiarity with
VectorStoreIndex,QueryEngine, and LlamaIndex callbacks
Install the package:
pip install llama-index
Step-by-Step
- •Start by wiring up a callback handler that tracks token usage and estimated cost per event. LlamaIndex exposes callback plumbing through its global settings, which makes this easy to apply across your app without changing every query call.
import os
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.llms.openai import OpenAI
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "")
token_handler = TokenCountingHandler()
Settings.callback_manager = CallbackManager([token_handler])
Settings.llm = OpenAI(model="gpt-4o-mini")
- •Load some data and build an index normally. The point here is that cost tracking should be invisible to your retrieval logic; it should sit underneath the index and capture every completion call.
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("Summarize the main risks mentioned in these documents.")
print(response)
- •Read the token counts after the query finishes. For advanced use cases, this is where you can attach the numbers to request IDs, tenant IDs, or workflow spans before shipping them to your observability stack.
print("Prompt tokens:", token_handler.prompt_llm_token_count)
print("Completion tokens:", token_handler.completion_llm_token_count)
print("Total tokens:", token_handler.total_llm_token_count)
print("Embedding tokens:", token_handler.total_embedding_token_count)
- •If you want actual dollar estimates, compute them from model pricing yourself. LlamaIndex gives you the usage data; your app should own the pricing table because vendor rates change and different models have different input/output costs.
def estimate_cost(prompt_tokens: int, completion_tokens: int) -> float:
input_rate = 0.15 / 1_000_000 # example: $0.15 per 1M input tokens
output_rate = 0.60 / 1_000_000 # example: $0.60 per 1M output tokens
return (prompt_tokens * input_rate) + (completion_tokens * output_rate)
estimated = estimate_cost(
token_handler.prompt_llm_token_count,
token_handler.completion_llm_token_count,
)
print(f"Estimated cost: ${estimated:.6f}")
- •Reset counters between requests if you’re serving traffic in a web app or agent loop. Without this, one user’s usage bleeds into the next request and your accounting becomes useless.
token_handler.reset_counts()
response = query_engine.query("List any compliance issues mentioned.")
print(response)
print("Prompt tokens after reset:", token_handler.prompt_llm_token_count)
print("Completion tokens after reset:", token_handler.completion_llm_token_count)
Testing It
Run two different queries back-to-back and compare the counters before and after reset_counts(). You should see totals increase after each query, then return to zero once reset.
If you’re using real documents in data/, make sure the index builds successfully and that the query produces a non-empty response. Then check that prompt and completion token counts are both non-zero for at least one query.
For production validation, wrap each request in a request-scoped object and log:
- •user ID or tenant ID
- •query text hash
- •prompt tokens
- •completion tokens
- •estimated cost
That gives you an audit trail you can reconcile against provider billing later.
Next Steps
- •Add a custom callback handler to emit usage metrics into Prometheus, Datadog, or OpenTelemetry.
- •Track costs per tool call inside agent workflows, not just per top-level query.
- •Store per-request usage records in Postgres so finance and engineering can both inspect spend by tenant, feature, or route.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit