LlamaIndex Tutorial (Python): optimizing token usage for beginners

By Cyprian AaronsUpdated 2026-04-21
llamaindexoptimizing-token-usage-for-beginnerspython

This tutorial shows you how to build a basic LlamaIndex RAG pipeline in Python while reducing token usage at each step. You need this when your prompts are getting expensive, your context windows are filling up too fast, or you want a beginner-friendly setup that still behaves like production code.

What You'll Need

  • Python 3.10+
  • A virtual environment
  • llama-index
  • llama-index-llms-openai
  • llama-index-embeddings-openai
  • An OpenAI API key
  • A small local document set, or a few text files to test with

Install the packages first:

pip install llama-index llama-index-llms-openai llama-index-embeddings-openai

Set your API key in the environment:

export OPENAI_API_KEY="your-api-key"

Step-by-Step

  1. Start with a minimal index and keep your documents small. Token waste usually begins before retrieval even happens, so use short, focused source files instead of dumping huge PDFs into the pipeline.
from pathlib import Path

docs_dir = Path("docs")
docs_dir.mkdir(exist_ok=True)

(docs_dir / "claims_policy.txt").write_text(
    "Claims must be filed within 30 days. "
    "Required documents include ID, incident report, and receipts. "
    "Escalate fraud cases to compliance."
)

(docs_dir / "billing_policy.txt").write_text(
    "Billing disputes should be reviewed within 5 business days. "
    "Refunds require manager approval above $500."
)
  1. Load only what you need and use smaller chunks. Smaller chunks reduce embedding cost and stop the retriever from dragging in unnecessary text.
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

documents = SimpleDirectoryReader("docs").load_data()

splitter = SentenceSplitter(chunk_size=120, chunk_overlap=20)
nodes = splitter.get_nodes_from_documents(documents)

index = VectorStoreIndex(nodes)
  1. Configure the retriever to return fewer nodes. For beginners, similarity_top_k=2 is usually enough to prove the pattern without flooding the prompt with extra context.
retriever = index.as_retriever(similarity_top_k=2)

query_engine = index.as_query_engine(
    similarity_top_k=2,
    response_mode="compact",
)

response = query_engine.query("What documents are needed for a claims filing?")
print(response)
  1. Use a cheaper model for retrieval-heavy workflows and keep generation constrained. If your app mostly answers short factual questions, a smaller model is often enough and saves tokens immediately.
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

query_engine = index.as_query_engine(
    similarity_top_k=2,
    response_mode="compact",
)

response = query_engine.query("When do billing disputes need review?")
print(response)
  1. Trim prompt overhead by using concise system instructions and no extra chat history unless you need it. Every extra message becomes more tokens, so keep the interaction state lean.
from llama_index.core.prompts import PromptTemplate

custom_prompt = PromptTemplate(
    "Answer the question using only the context.\n"
    "If the answer is missing, say 'I don't know'.\n"
    "Context:\n{context_str}\n\nQuestion: {query_str}\nAnswer:"
)

query_engine = index.as_query_engine(
    similarity_top_k=2,
    response_mode="compact",
    text_qa_template=custom_prompt,
)

response = query_engine.query("What is the refund threshold?")
print(response)

Testing It

Run the script and ask two or three direct questions from the source files. You should see short answers that come back quickly without a long block of retrieved text in the output.

Then change similarity_top_k from 2 to 5 and compare latency and answer length. You will usually see more context included, which is useful for recall but worse for token usage.

Finally, increase chunk_size to something like 500 and notice how retrieval becomes less precise on small policy docs. That tradeoff matters in production because bigger chunks often mean more irrelevant tokens get sent to the model.

Next Steps

  • Add metadata filters so retrieval only searches relevant document types.
  • Learn TokenTextSplitter and compare it against SentenceSplitter.
  • Move from basic querying to a RetrieverQueryEngine with reranking when precision matters more than raw speed.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides