LlamaIndex Tutorial (Python): adding cost tracking for beginners

By Cyprian AaronsUpdated 2026-04-21

llamaindexadding-cost-tracking-for-beginnerspython

This tutorial shows you how to add cost tracking to a LlamaIndex Python app so you can estimate token spend per request and per run. You need this because once you move beyond local prototypes, LLM usage costs become a real operational metric, especially when multiple agents, retries, and tool calls are involved.

What You'll Need

•Python 3.10+
•A working OpenAI API key set as OPENAI_API_KEY
•llama-index installed
•tiktoken installed for token counting
•Basic familiarity with VectorStoreIndex, QueryEngine, and Settings in LlamaIndex

Install the packages:

pip install llama-index tiktoken

Step-by-Step

•Start with a minimal LlamaIndex app that can answer questions from local text. The important part here is not the data source; it is having a baseline query flow where we can attach cost tracking.

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = "local:BAAI/bge-small-en-v1.5"

docs = [
    Document(text="LlamaIndex helps connect data sources to LLMs."),
    Document(text="Cost tracking is useful for monitoring token usage.")
]

index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()

response = query_engine.query("Why track costs in an AI app?")
print(response)

•Add a small cost calculator that converts token counts into dollars. For beginners, a simple model-specific rate table is enough; you do not need a full billing system on day one.

from dataclasses import dataclass

@dataclass
class CostRates:
    input_per_1k: float
    output_per_1k: float

RATES = {
    "gpt-4o-mini": CostRates(input_per_1k=0.00015, output_per_1k=0.00060),
}

def estimate_cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
    rates = RATES[model]
    return (prompt_tokens / 1000 * rates.input_per_1k) + (
        completion_tokens / 1000 * rates.output_per_1k
    )

•Wrap your query call with LlamaIndex’s callback system so you can capture token usage from the actual run. This is the part that makes the estimate grounded in real usage instead of guessed prompt length.

from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.core import Settings

token_counter = TokenCountingHandler()
Settings.callback_manager = CallbackManager([token_counter])

response = query_engine.query("Why track costs in an AI app?")

prompt_tokens = token_counter.prompt_llm_token_count
completion_tokens = token_counter.completion_llm_token_count

print("Answer:", response)
print("Prompt tokens:", prompt_tokens)
print("Completion tokens:", completion_tokens)

•Combine the token counts with your rate table and print the estimated cost for each request. In production, this is where you would send metrics to logs, Prometheus, Datadog, or your database.

model_name = "gpt-4o-mini"

cost = estimate_cost(
    model=model_name,
    prompt_tokens=prompt_tokens,
    completion_tokens=completion_tokens,
)

print(f"Estimated cost for this query: ${cost:.6f}")

•If you want per-request tracking across multiple queries, reset the counter before each call and store the result in a list or log record. That gives you a clean audit trail for individual user actions instead of one cumulative number.

queries = [
    "What does LlamaIndex do?",
    "How do I track token costs?",
]

results = []

for q in queries:
    token_counter.reset_counts()
    response = query_engine.query(q)

    prompt_tokens = token_counter.prompt_llm_token_count
    completion_tokens = token_counter.completion_llm_token_count
    cost = estimate_cost("gpt-4o-mini", prompt_tokens, completion_tokens)

    results.append({
        "query": q,
        "answer": str(response),
        "prompt_tokens": prompt_tokens,
        "completion_tokens": completion_tokens,
        "estimated_cost_usd": cost,
    })

for item in results:
    print(item)

Testing It

Run the script and confirm that each query returns an answer plus non-zero token counts. If your counts stay at zero, check that Settings.callback_manager is set before the query runs and that you are using an LLM-backed query engine rather than only embeddings.

You should also verify that the estimated dollar amount changes when you ask longer questions or request more detailed answers. That is usually the fastest sanity check that your tracking is wired correctly.

If you want to test this more rigorously, run the same query several times and compare the counts across runs. Small variations are normal if your model settings allow non-deterministic output.

Next Steps

•Store these metrics alongside user/session IDs so you can see which workflows are expensive.
•Add support for multiple models by keeping a per-model pricing map in config.
•Export token and cost data to your observability stack instead of printing it to stdout.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit