LlamaIndex Tutorial (TypeScript): optimizing token usage for beginners

By Cyprian AaronsUpdated 2026-04-21
llamaindexoptimizing-token-usage-for-beginnerstypescript

This tutorial shows you how to reduce token usage in a TypeScript LlamaIndex app without breaking retrieval quality. You need this when your prompts are too large, your context windows are getting burned on irrelevant text, or your API bill is climbing because every query ships too much data.

What You'll Need

  • Node.js 18+
  • A TypeScript project with tsconfig.json
  • npm or pnpm
  • OpenAI API key in OPENAI_API_KEY
  • Packages:
    • llamaindex
    • dotenv
    • typescript
    • tsx for running TypeScript directly during development

Install them like this:

npm install llamaindex dotenv
npm install -D typescript tsx @types/node

Step-by-Step

  1. Start with a minimal project setup that loads your API key and keeps the runtime simple. The goal here is to make sure every later optimization is measurable, not hidden behind setup noise.
import "dotenv/config";
import { Document, Settings } from "llamaindex";
import { OpenAI } from "llamaindex";

Settings.llm = new OpenAI({
  model: "gpt-4o-mini",
});

const docs = [
  new Document({ text: "LlamaIndex helps build RAG apps with structured retrieval." }),
  new Document({ text: "Token usage grows when you send too much context to the model." }),
];

console.log(`Loaded ${docs.length} documents`);
  1. Use smaller chunk sizes when indexing so retrieval pulls back tighter context blocks. This is one of the easiest ways to lower tokens because the query engine does not need to carry around huge chunks just to answer a small question.
import "dotenv/config";
import { Document, VectorStoreIndex, Settings } from "llamaindex";
import { OpenAI } from "llamaindex";

Settings.llm = new OpenAI({ model: "gpt-4o-mini" });

async function main() {
  const docs = [
    new Document({
      text: "LlamaIndex supports retrieval augmented generation and document indexing.",
    }),
    new Document({
      text: "Smaller chunks can reduce prompt size but may lose some surrounding context.",
    }),
  ];

  const index = await VectorStoreIndex.fromDocuments(docs, {
    chunkSize: 80,
    chunkOverlap: 10,
  });

  console.log("Index built");
}

main();
  1. Limit how many nodes you retrieve for each question. Beginners usually leave defaults alone, then wonder why every response includes too much source material; setting similarityTopK gives you a direct cap on retrieved context.
import "dotenv/config";
import {
  Document,
  VectorStoreIndex,
  Settings,
} from "llamaindex";
import { OpenAI } from "llamaindex";

Settings.llm = new OpenAI({ model: "gpt-4o-mini" });

async function main() {
  const docs = [
    new Document({ text: "Claims workflows should minimize manual review." }),
    new Document({ text: "Policy documents often contain repeated legal boilerplate." }),
    new Document({ text: "Customer support notes can be noisy and long." }),
  ];

  const index = await VectorStoreIndex.fromDocuments(docs);
  const queryEngine = index.asQueryEngine({
    similarityTopK: 2,
  });

  const response = await queryEngine.query({
    query: "What reduces unnecessary token usage?",
  });

  console.log(response.toString());
}

main();
  1. Add a compact prompt and keep the answer format strict. If you let the model improvise, it tends to spend tokens restating context; a short system-style instruction keeps outputs focused.
import "dotenv/config";
import {
  Document,
  VectorStoreIndex,
} from "llamaindex";

async function main() {
  const docs = [
    new Document({ text: "Use top-k retrieval to limit context size." }),
    new Document({ text: "Keep chunk sizes small and relevant." }),
    new Document({ text: "Summarize long documents before indexing when possible." }),
  ];

  const index = await VectorStoreIndex.fromDocuments(docs);
  const queryEngine = index.asQueryEngine({
    similarityTopK: 1,
    responseSynthesizer: {
      getResponseBuilderPrompt() {
        return `
Answer in one sentence.
Use only the retrieved context.
Do not repeat the question.
`.trim();
      },
    },
  });

  const response = await queryEngine.query({
    query: "How do I reduce token usage?",
  });

  console.log(response.toString());
}

main();
  1. Measure what changed by comparing responses before and after each optimization. In production, token savings matter only if answer quality stays acceptable, so test retrieval depth, chunk size, and output length together.
import "dotenv/config";
import { Document, VectorStoreIndex } from "llamaindex";

async function run(topK: number) {
  const docs = [
    new Document({ text: "Short chunks reduce prompt bloat." }),
    new Document({ text: "Retrieving fewer nodes lowers token count." }),
    new Document({ text: "Summaries can replace raw long-form content." }),
  ];

  const index = await VectorStoreIndex.fromDocuments(docs);
  const qe = index.asQueryEngine({ similarityTopK: topK });

  const result = await qe.query({
    query: "What are the simplest token-saving tactics?",
  });

  console.log(`\n--- topK=${topK} ---`);
  console.log(result.toString());
}

await run(1);
await run(3);

Testing It

Run the script with npx tsx your-file.ts and compare output length between different similarityTopK values. If top-k goes up and your answers get longer without improving accuracy, you are paying for unnecessary context.

Also inspect whether small chunks still preserve enough meaning for your domain. In insurance or banking data, overly aggressive chunking can cut apart policy clauses or transaction notes that need nearby context.

A good sanity check is to ask three versions of the same question:

  • one broad
  • one specific
  • one intentionally vague

If the system stays concise on all three, your token controls are doing their job.

Next Steps

  • Learn metadata filtering so retrieval only searches relevant products, regions, or policy types.
  • Add a reranker to improve precision without increasing retrieved context too much.
  • Track prompt and completion tokens per request so you can enforce budgets in production.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides