LlamaIndex Tutorial (Python): adding cost tracking for intermediate developers

By Cyprian AaronsUpdated 2026-04-21

llamaindexadding-cost-tracking-for-intermediate-developerspython

This tutorial shows you how to add token and cost tracking to a LlamaIndex Python app so you can see what each query is costing you. You need this when your RAG app starts moving from local experiments to real usage, where silent token burn becomes a budget problem fast.

What You'll Need

•Python 3.10+
•A working OpenAI API key
•llama-index
•llama-index-llms-openai
•llama-index-embeddings-openai
•A .env file or exported environment variables
•
Basic familiarity with:
- •VectorStoreIndex
- •SimpleDirectoryReader
- •QueryEngine

Install the packages:

pip install llama-index llama-index-llms-openai llama-index-embeddings-openai python-dotenv

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

•Start by wiring up the LLM and embedding model explicitly. This keeps the example deterministic and avoids hidden defaults that make cost tracking harder to reason about.

import os

from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "")

Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

•Load your documents and build an index. In a real project, this would usually be policy docs, claims notes, product manuals, or internal knowledge base content.

documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(
    documents,
    show_progress=True,
)

•Create a query engine and use LlamaIndex’s built-in callback manager to capture token usage. The TokenCountingHandler gives you per-run token totals, which is the foundation for cost attribution.

from llama_index.core.callbacks import CallbackManager, TokenCountingHandler

token_counter = TokenCountingHandler()
Settings.callback_manager = CallbackManager([token_counter])

query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What does the documentation say about refunds?")
print(response)

•Convert token usage into estimated dollar cost. LlamaIndex gives you tokens; your code should own pricing math so it stays explicit and easy to update when model prices change.

def estimate_cost(prompt_tokens: int, completion_tokens: int) -> float:
    prompt_rate = 0.15 / 1_000_000   # example: $0.15 per 1M input tokens
    completion_rate = 0.60 / 1_000_000  # example: $0.60 per 1M output tokens
    return (prompt_tokens * prompt_rate) + (completion_tokens * completion_rate)

print(f"Prompt tokens: {token_counter.prompt_llm_token_count}")
print(f"Completion tokens: {token_counter.completion_llm_token_count}")
print(f"Embedding tokens: {token_counter.embedding_token_count}")

estimated = estimate_cost(
    token_counter.prompt_llm_token_count,
    token_counter.completion_llm_token_count,
)
print(f"Estimated LLM cost: ${estimated:.6f}")

•Wrap the query flow in a helper so every request gets tracked consistently. This is the pattern you want in production because it makes logging, metrics export, and request-level attribution straightforward.

def run_tracked_query(question: str):
    token_counter.reset_counts()
    response = query_engine.query(question)

    cost = estimate_cost(
        token_counter.prompt_llm_token_count,
        token_counter.completion_llm_token_count,
    )

    return {
        "answer": str(response),
        "prompt_tokens": token_counter.prompt_llm_token_count,
        "completion_tokens": token_counter.completion_llm_token_count,
        "embedding_tokens": token_counter.embedding_token_count,
        "estimated_cost_usd": round(cost, 6),
    }

result = run_tracked_query("Summarize the refund policy in one paragraph.")
print(result)

•If you want stronger observability, log the result structure to your application logs or ship it to your metrics stack. For most teams, that means storing request ID, user ID, question text hash, and estimated cost together.

import json

tracked = run_tracked_query("List the main onboarding steps.")
print(json.dumps(tracked, indent=2))

Testing It

Run the script against a small folder of text files in ./data, then ask two different questions and compare the reported counts. You should see prompt tokens increase when retrieved context grows, and completion tokens change based on answer length.

If all counts stay at zero, check that Settings.callback_manager is set before you create or call the query engine. Also verify that your OpenAI key is loaded and that you installed the OpenAI-specific LlamaIndex integrations.

For a quick sanity check, ask one short factual question and one broad summarization question. The summarization query should usually cost more because it pulls more context and produces a longer completion.

Next Steps

•Add per-request logging with a request ID so costs can be traced back to users or workflows.
•Export these numbers to Prometheus or OpenTelemetry instead of printing them.
•Add model-specific pricing tables so you can track costs across multiple LLMs and embedding models without changing code paths

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit