LlamaIndex Tutorial (TypeScript): optimizing token usage for advanced developers

By Cyprian AaronsUpdated 2026-04-21

llamaindexoptimizing-token-usage-for-advanced-developerstypescript

This tutorial shows how to reduce token spend in a TypeScript LlamaIndex pipeline without breaking retrieval quality. You’ll build a setup that trims context, controls chunking, and avoids sending unnecessary text to the model.

What You'll Need

•Node.js 18+
•A TypeScript project with ts-node or tsx
•
Packages:
- •llamaindex
- •dotenv
•An OpenAI API key in OPENAI_API_KEY
•A small set of source documents in plain text or markdown
•Basic familiarity with LlamaIndex concepts like documents, nodes, retrievers, and query engines

Step-by-Step

•Start by installing the dependencies and setting up environment variables. The main token savings come from reducing what gets indexed and what gets sent into the final prompt.

npm install llamaindex dotenv
npm install -D typescript tsx @types/node

OPENAI_API_KEY=your_openai_key_here

•Load only the documents you actually need, and keep them clean before indexing. If your source files contain headers, boilerplate, or repeated legal text, strip that before it ever becomes context.

import "dotenv/config";
import { Document } from "llamaindex";
import fs from "node:fs/promises";

async function loadDocs() {
  const raw = await fs.readFile("./data/policy.md", "utf8");
  const cleaned = raw
    .replace(/^\s*#{1,6}\s+/gm, "")
    .replace(/\n{3,}/g, "\n\n")
    .trim();

  return [
    new Document({
      text: cleaned,
      metadata: { source: "policy.md" },
    }),
  ];
}

•Build an index with smaller chunks so each retrieval pulls less text. For token efficiency, you want chunks that are large enough to preserve meaning but small enough to avoid bloated prompts.

import {
  VectorStoreIndex,
  SentenceSplitter,
} from "llamaindex";

async function buildIndex() {
  const docs = await loadDocs();

  const splitter = new SentenceSplitter({
    chunkSize: 300,
    chunkOverlap: 40,
  });

  return await VectorStoreIndex.fromDocuments(docs, {
    transformations: [splitter],
  });
}

•Add a retriever that limits how many chunks get sent downstream. This is one of the biggest levers for token control: if you retrieve eight chunks when two would do, your prompt cost will jump immediately.

import { SimilarityPostprocessor } from "llamaindex";

async function makeRetriever() {
  const index = await buildIndex();

  return index.asRetriever({
    similarityTopK: 2,
    nodePostprocessors: [
      new SimilarityPostprocessor({
        similarityCutoff: 0.78,
      }),
    ],
  });
}

•Create a compact query engine with a tight response mode. Use compact when you want LlamaIndex to pack retrieved context efficiently instead of dumping everything into a long synthesis prompt.

import { RetrieverQueryEngine } from "llamaindex";

async function ask(question: string) {
  const retriever = await makeRetriever();

  const engine = RetrieverQueryEngine.fromArgs({
    retriever,
    responseMode: "compact",
  });

  const response = await engine.query({ query: question });
  console.log(response.toString());
}

ask("What is the policy's refund window?");

•If you need even tighter control, inspect retrieved nodes before querying. In production systems, I often drop low-value chunks or summarize them before they hit the final prompt.

async function debugRetrieval(question: string) {
  const retriever = await makeRetriever();
  const nodes = await retriever.retrieve(question);

  for (const node of nodes) {
    console.log("SCORE:", node.score);
    console.log("TEXT:", node.node.getText().slice(0, 180));
    console.log("---");
  }
}

debugRetrieval("What is the policy's refund window?");

Testing It

Run the script against a question that should match one specific section of your document. If the answer comes back correct while only retrieving one or two chunks, your token footprint is under control.

Check the debug output and confirm low-scoring nodes are being filtered out by the cutoff. If you still see too much text in the response path, lower similarityTopK, reduce chunkSize, or make your source docs cleaner.

Watch for answers that become vague after aggressive trimming. That usually means you cut too hard on chunk size or retrieval depth; increase chunkSize slightly before raising topK.

If you want real numbers, measure input tokens at the model layer and compare runs with different settings. The pattern you’re looking for is stable answer quality with fewer retrieved tokens per query.

Next Steps

•Add metadata filters so retrieval only searches the right document class or business unit
•Replace raw retrieval with a reranking step for better precision at lower topK
•Add caching for repeated queries and shared retrieval results

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit