LlamaIndex Tutorial (Python): adding cost tracking for beginners
This tutorial shows you how to add cost tracking to a LlamaIndex Python app so you can estimate token spend per request and per run. You need this because once you move beyond local prototypes, LLM usage costs become a real operational metric, especially when multiple agents, retries, and tool calls are involved.
What You'll Need
- •Python 3.10+
- •A working OpenAI API key set as
OPENAI_API_KEY - •
llama-indexinstalled - •
tiktokeninstalled for token counting - •Basic familiarity with
VectorStoreIndex,QueryEngine, andSettingsin LlamaIndex
Install the packages:
pip install llama-index tiktoken
Step-by-Step
- •Start with a minimal LlamaIndex app that can answer questions from local text. The important part here is not the data source; it is having a baseline query flow where we can attach cost tracking.
from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = "local:BAAI/bge-small-en-v1.5"
docs = [
Document(text="LlamaIndex helps connect data sources to LLMs."),
Document(text="Cost tracking is useful for monitoring token usage.")
]
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("Why track costs in an AI app?")
print(response)
- •Add a small cost calculator that converts token counts into dollars. For beginners, a simple model-specific rate table is enough; you do not need a full billing system on day one.
from dataclasses import dataclass
@dataclass
class CostRates:
input_per_1k: float
output_per_1k: float
RATES = {
"gpt-4o-mini": CostRates(input_per_1k=0.00015, output_per_1k=0.00060),
}
def estimate_cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
rates = RATES[model]
return (prompt_tokens / 1000 * rates.input_per_1k) + (
completion_tokens / 1000 * rates.output_per_1k
)
- •Wrap your query call with LlamaIndex’s callback system so you can capture token usage from the actual run. This is the part that makes the estimate grounded in real usage instead of guessed prompt length.
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.core import Settings
token_counter = TokenCountingHandler()
Settings.callback_manager = CallbackManager([token_counter])
response = query_engine.query("Why track costs in an AI app?")
prompt_tokens = token_counter.prompt_llm_token_count
completion_tokens = token_counter.completion_llm_token_count
print("Answer:", response)
print("Prompt tokens:", prompt_tokens)
print("Completion tokens:", completion_tokens)
- •Combine the token counts with your rate table and print the estimated cost for each request. In production, this is where you would send metrics to logs, Prometheus, Datadog, or your database.
model_name = "gpt-4o-mini"
cost = estimate_cost(
model=model_name,
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
)
print(f"Estimated cost for this query: ${cost:.6f}")
- •If you want per-request tracking across multiple queries, reset the counter before each call and store the result in a list or log record. That gives you a clean audit trail for individual user actions instead of one cumulative number.
queries = [
"What does LlamaIndex do?",
"How do I track token costs?",
]
results = []
for q in queries:
token_counter.reset_counts()
response = query_engine.query(q)
prompt_tokens = token_counter.prompt_llm_token_count
completion_tokens = token_counter.completion_llm_token_count
cost = estimate_cost("gpt-4o-mini", prompt_tokens, completion_tokens)
results.append({
"query": q,
"answer": str(response),
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"estimated_cost_usd": cost,
})
for item in results:
print(item)
Testing It
Run the script and confirm that each query returns an answer plus non-zero token counts. If your counts stay at zero, check that Settings.callback_manager is set before the query runs and that you are using an LLM-backed query engine rather than only embeddings.
You should also verify that the estimated dollar amount changes when you ask longer questions or request more detailed answers. That is usually the fastest sanity check that your tracking is wired correctly.
If you want to test this more rigorously, run the same query several times and compare the counts across runs. Small variations are normal if your model settings allow non-deterministic output.
Next Steps
- •Store these metrics alongside user/session IDs so you can see which workflows are expensive.
- •Add support for multiple models by keeping a per-model pricing map in config.
- •Export token and cost data to your observability stack instead of printing it to stdout.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit