LlamaIndex Tutorial (Python): optimizing token usage for advanced developers

By Cyprian AaronsUpdated 2026-04-21

llamaindexoptimizing-token-usage-for-advanced-developerspython

This tutorial shows you how to reduce token spend in a LlamaIndex-based Python app without breaking retrieval quality. You’ll wire in smaller context windows, tighter chunking, selective retrieval, and response synthesis controls so your agent stops paying for irrelevant text.

What You'll Need

•Python 3.10+
•llama-index
•An LLM API key, such as OPENAI_API_KEY
•A small local document set to test with
•Basic familiarity with VectorStoreIndex, QueryEngine, and embeddings
•Optional: tiktoken if you want to inspect token counts more precisely

Step-by-Step

•Start by installing the core packages and setting your API key. Keep the dependency surface small; token optimization starts with controlling what gets indexed and retrieved, not adding more tooling.

pip install llama-index llama-index-llms-openai llama-index-embeddings-openai
export OPENAI_API_KEY="your-api-key"

•Build your index with smaller chunks and a bounded overlap. Large chunks inflate embedding cost and make retrieval return too much irrelevant text, which then gets pushed into the prompt.

from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=64)

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

•Tighten retrieval so the query engine only sends the top few nodes into synthesis. This is where most token waste happens in production: people retrieve 10–20 nodes by default when 2–4 is enough.

query_engine = index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact",
    use_async=False,
)

response = query_engine.query("Summarize the policy exception process.")
print(response)

•Add metadata filters when you know the domain slice you need. Filtering before retrieval is cheaper than retrieving broadly and asking the model to ignore half the context.

from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter

filters = MetadataFilters(
    filters=[
        ExactMatchFilter(key="department", value="claims"),
        ExactMatchFilter(key="doc_type", value="policy"),
    ]
)

filtered_engine = index.as_query_engine(
    similarity_top_k=2,
    filters=filters,
    response_mode="compact",
)

print(filtered_engine.query("What is the escalation path for exceptions?"))

•Use a compact response synthesizer and keep the prompt short. If your use case does not require chain-of-thought style multi-pass reasoning, avoid expensive synthesis modes that expand token usage across multiple internal calls.

from llama_index.core.response_synthesizers import get_response_synthesizer

synthesizer = get_response_synthesizer(response_mode="compact")

query_engine = index.as_query_engine(
    similarity_top_k=2,
    response_synthesizer=synthesizer,
)

answer = query_engine.query("List the approval steps for claims overrides.")
print(answer)

•For advanced control, inspect how many tokens your retrieved context would consume before sending it to the model. This lets you enforce hard budgets in production instead of guessing after invoices arrive.

from llama_index.core.schema import QueryBundle

query_bundle = QueryBundle(query_str="What are the approval steps for claims overrides?")
retriever = index.as_retriever(similarity_top_k=2)

nodes = retriever.retrieve(query_bundle.query_str)
context_text = "\n\n".join(node.get_content() for node in nodes)

print(f"Retrieved nodes: {len(nodes)}")
print(f"Approx chars: {len(context_text)}")
print(context_text[:1200])

Testing It

Run a few representative queries and compare output quality against your old configuration. You want fewer retrieved nodes, shorter prompts, and answers that still cite the right source material.

Check these signals:

•The answer stays grounded in your documents
•Retrieved context is visibly shorter than before
•Queries return faster under load
•Token usage drops when you compare request logs from your LLM provider

If you want a hard test, run one query through both versions of the pipeline and compare prompt tokens in your provider dashboard. In practice, shrinking chunk size from 1,024 to 512 and reducing similarity_top_k from 8 to 2 or 3 usually gives an immediate cost drop.

Next Steps

•Add reranking so you can retrieve more candidates but only pass the best few into synthesis.
•Learn RouterQueryEngine so different question types hit different indexes or tools.
•Put token budgets behind config flags per tenant or per workflow stage so finance-heavy flows can be stricter than internal search.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit