LlamaIndex Tutorial (Python): optimizing token usage for intermediate developers

By Cyprian AaronsUpdated 2026-04-21

llamaindexoptimizing-token-usage-for-intermediate-developerspython

This tutorial shows you how to reduce token usage in a LlamaIndex Python app without breaking retrieval quality. You need this when your prompts are too large, your chunking is wasteful, or your retrieval pipeline is sending too much context to the LLM.

What You'll Need

•Python 3.10+
•A virtual environment
•llama-index
•An OpenAI API key
•A small text corpus in .txt files for local testing
•Basic familiarity with VectorStoreIndex, Settings, and QueryEngine

Step-by-Step

•Start by installing the core packages and setting your API key. Keep the dependency list tight so you can measure token usage without extra moving parts.

pip install llama-index openai tiktoken
export OPENAI_API_KEY="your-api-key"

•Load documents with a chunk size that matches your use case. Smaller chunks can improve precision, but too-small chunks increase retrieval overhead and prompt assembly cost.

from llama_index.core import SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

documents = SimpleDirectoryReader("./data").load_data()
print(f"Loaded {len(documents)} documents")

•Build an index with conservative defaults, then tune retrieval to send fewer nodes into the prompt. The main idea is to retrieve fewer chunks and keep the context window focused on only what matters.

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    documents,
    show_progress=True,
)

query_engine = index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact",
)

response = query_engine.query("What are the main policy exclusions?")
print(response)

•Add a postprocessor that trims low-value nodes before they reach the LLM. This is one of the simplest ways to cut tokens because it removes weak matches that still consume prompt space.

from llama_index.core.indices.postprocessor import SimilarityPostprocessor

query_engine = index.as_query_engine(
    similarity_top_k=5,
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.75),
    ],
    response_mode="compact",
)

response = query_engine.query("Summarize the cancellation terms.")
print(response)

•Use metadata filters when you already know part of the answer domain. If you can narrow by source, date, product line, or jurisdiction, you avoid paying tokens for irrelevant chunks.

from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter

filters = MetadataFilters(filters=[
    ExactMatchFilter(key="document_type", value="policy"),
])

query_engine = index.as_query_engine(
    similarity_top_k=3,
    filters=filters,
    response_mode="compact",
)

response = query_engine.query("What happens after a missed payment?")
print(response)

•Measure whether your changes actually reduce token usage. For production work, you want to compare prompt size before and after each optimization instead of guessing.

from llama_index.core.callbacks import CallbackManager, TokenCountingHandler

token_counter = TokenCountingHandler()
Settings.callback_manager = CallbackManager([token_counter])

query_engine = index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact",
)

_ = query_engine.query("Explain the claims filing timeline.")
print(f"Prompt tokens: {token_counter.prompt_llm_token_count}")
print(f"Completion tokens: {token_counter.completion_llm_token_count}")
print(f"Total tokens: {token_counter.total_llm_token_count}")

Testing It

Run the same query before and after each change and compare token counts from TokenCountingHandler. If your counts drop while answer quality stays stable, the optimization is doing its job.

Also check whether retrieved nodes are still relevant by printing them during debugging. If answers get shorter but less accurate, raise similarity_top_k slightly or lower the similarity cutoff until you find the balance.

For a cleaner test, use 5 to 10 repeated queries from the same domain and record average token usage. That gives you a real baseline instead of a one-off result.

Next Steps

•Learn SentenceWindowNodeParser for better chunk boundaries with lower context waste
•Add reranking with CohereRerank or a local reranker before synthesis
•Move from plain vector search to hybrid retrieval when keyword precision matters

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit