LlamaIndex Tutorial (TypeScript): optimizing token usage for intermediate developers

By Cyprian AaronsUpdated 2026-04-21

llamaindexoptimizing-token-usage-for-intermediate-developerstypescript

This tutorial shows you how to reduce token usage in a LlamaIndex TypeScript pipeline without breaking retrieval quality. You’ll build a small RAG flow that trims context, caps chunk size, and avoids sending unnecessary text to the model.

What You'll Need

•Node.js 18+ and npm
•A TypeScript project with "type": "module" or a TS build setup
•llamaindex installed
•An OpenAI API key set in OPENAI_API_KEY
•A few local text files to index, or any small corpus you want to test against

Install the package:

npm install llamaindex

Step-by-Step

•Start by loading only the documents you actually need. Token waste usually begins at ingestion, so keep your source set small and explicit instead of pointing the loader at an entire directory full of irrelevant files.

import { SimpleDirectoryReader } from "llamaindex";

async function loadDocs() {
  const reader = new SimpleDirectoryReader();
  const docs = await reader.loadData({
    directoryPath: "./data",
    requiredExts: [".txt"],
  });

  console.log(`Loaded ${docs.length} documents`);
  return docs;
}

loadDocs();

•Next, split documents into smaller chunks. Smaller chunks mean fewer irrelevant tokens get pulled into retrieval, which matters when your queries are narrow and your source docs are long.

import { SentenceSplitter } from "llamaindex";

const splitter = new SentenceSplitter({
  chunkSize: 256,
  chunkOverlap: 32,
});

async function chunkText(text: string) {
  const nodes = await splitter.getNodesFromDocuments([
    { text } as any,
  ]);

  console.log(`Created ${nodes.length} nodes`);
  return nodes;
}

•Build the index with a compact embedding model and keep your retriever top-k low. For token optimization, the main rule is simple: retrieve fewer chunks, but make them better chunks.

import {
  VectorStoreIndex,
  Settings,
  OpenAIEmbedding,
} from "llamaindex";

Settings.embedModel = new OpenAIEmbedding({
  model: "text-embedding-3-small",
});

async function buildIndex(docs: any[]) {
  const index = await VectorStoreIndex.fromDocuments(docs);
  return index;
}

•Use a response synthesizer that does not stuff too much context into the prompt. In practice, this means setting a low similarityTopK and using a compact prompt template so the model sees only what it needs.

import {
  QueryEngineTool,
  OpenAI,
} from "llamaindex";

async function queryIndex(index: VectorStoreIndex) {
  Settings.llm = new OpenAI({
    model: "gpt-4o-mini",
    temperature: 0,
  });

  const queryEngine = index.asQueryEngine({
    similarityTopK: 2,
    responseMode: "compact",
  });

  const response = await queryEngine.query({
    query: "Summarize the refund policy in two bullets.",
  });

  console.log(response.toString());
}

•Add a rerank or filtering layer only when retrieval quality needs it. Don’t send more chunks to the LLM just because retrieval is noisy; fix retrieval first, then expand only if the answer quality demands it.

import { MetadataFilter, FilterCondition } from "llamaindex";

async function filteredQuery(index: VectorStoreIndex) {
  const queryEngine = index.asQueryEngine({
    similarityTopK: 2,
    preFilters: {
      filters: [
        new MetadataFilter({
          key: "department",
          value: "claims",
          operator: FilterCondition.EQ,
        }),
      ],
    },
    responseMode: "compact",
  });

  const response = await queryEngine.query({
    query: "What is the escalation path?",
  });

  console.log(response.toString());
}

•If you need even tighter control, inspect retrieved nodes before generation. This lets you verify whether token bloat is coming from bad chunking, too many retrieved nodes, or verbose source text.

async function inspectRetrieval(index: VectorStoreIndex) {
  const retriever = index.asRetriever({ similarityTopK: 2 });
  const nodes = await retriever.retrieve("Explain the cancellation policy.");

  for (const node of nodes) {
    console.log("Score:", node.score);
    console.log("Text:", node.node.getText().slice(0, 200));
    console.log("---");
  }
}

Testing It

Run one query with similarityTopK set to 2, then run it again with 5 and compare output length and latency. If your answers stay accurate while latency drops and prompts stay shorter, your token usage is moving in the right direction.

A practical check is to print retrieved node text before generation and confirm you are not pulling in unrelated paragraphs. If answers get worse after lowering chunk size, increase chunkSize slightly before increasing top-k.

You can also log your model usage in your provider dashboard and compare prompt tokens across runs. The goal is not just lower token counts; it’s lower token counts with stable answer quality.

Next Steps

•Add metadata-based routing so different document types use different chunk sizes and retriever settings
•Learn hybrid retrieval with keyword + vector search for better precision on short queries
•Add a reranker only after you’ve exhausted chunking, filtering, and top-k tuning

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit