LlamaIndex Tutorial (TypeScript): optimizing token usage for advanced developers
This tutorial shows how to reduce token spend in a TypeScript LlamaIndex pipeline without breaking retrieval quality. You’ll build a setup that trims context, controls chunking, and avoids sending unnecessary text to the model.
What You'll Need
- •Node.js 18+
- •A TypeScript project with
ts-nodeortsx - •Packages:
- •
llamaindex - •
dotenv
- •
- •An OpenAI API key in
OPENAI_API_KEY - •A small set of source documents in plain text or markdown
- •Basic familiarity with LlamaIndex concepts like documents, nodes, retrievers, and query engines
Step-by-Step
- •Start by installing the dependencies and setting up environment variables. The main token savings come from reducing what gets indexed and what gets sent into the final prompt.
npm install llamaindex dotenv
npm install -D typescript tsx @types/node
OPENAI_API_KEY=your_openai_key_here
- •Load only the documents you actually need, and keep them clean before indexing. If your source files contain headers, boilerplate, or repeated legal text, strip that before it ever becomes context.
import "dotenv/config";
import { Document } from "llamaindex";
import fs from "node:fs/promises";
async function loadDocs() {
const raw = await fs.readFile("./data/policy.md", "utf8");
const cleaned = raw
.replace(/^\s*#{1,6}\s+/gm, "")
.replace(/\n{3,}/g, "\n\n")
.trim();
return [
new Document({
text: cleaned,
metadata: { source: "policy.md" },
}),
];
}
- •Build an index with smaller chunks so each retrieval pulls less text. For token efficiency, you want chunks that are large enough to preserve meaning but small enough to avoid bloated prompts.
import {
VectorStoreIndex,
SentenceSplitter,
} from "llamaindex";
async function buildIndex() {
const docs = await loadDocs();
const splitter = new SentenceSplitter({
chunkSize: 300,
chunkOverlap: 40,
});
return await VectorStoreIndex.fromDocuments(docs, {
transformations: [splitter],
});
}
- •Add a retriever that limits how many chunks get sent downstream. This is one of the biggest levers for token control: if you retrieve eight chunks when two would do, your prompt cost will jump immediately.
import { SimilarityPostprocessor } from "llamaindex";
async function makeRetriever() {
const index = await buildIndex();
return index.asRetriever({
similarityTopK: 2,
nodePostprocessors: [
new SimilarityPostprocessor({
similarityCutoff: 0.78,
}),
],
});
}
- •Create a compact query engine with a tight response mode. Use
compactwhen you want LlamaIndex to pack retrieved context efficiently instead of dumping everything into a long synthesis prompt.
import { RetrieverQueryEngine } from "llamaindex";
async function ask(question: string) {
const retriever = await makeRetriever();
const engine = RetrieverQueryEngine.fromArgs({
retriever,
responseMode: "compact",
});
const response = await engine.query({ query: question });
console.log(response.toString());
}
ask("What is the policy's refund window?");
- •If you need even tighter control, inspect retrieved nodes before querying. In production systems, I often drop low-value chunks or summarize them before they hit the final prompt.
async function debugRetrieval(question: string) {
const retriever = await makeRetriever();
const nodes = await retriever.retrieve(question);
for (const node of nodes) {
console.log("SCORE:", node.score);
console.log("TEXT:", node.node.getText().slice(0, 180));
console.log("---");
}
}
debugRetrieval("What is the policy's refund window?");
Testing It
Run the script against a question that should match one specific section of your document. If the answer comes back correct while only retrieving one or two chunks, your token footprint is under control.
Check the debug output and confirm low-scoring nodes are being filtered out by the cutoff. If you still see too much text in the response path, lower similarityTopK, reduce chunkSize, or make your source docs cleaner.
Watch for answers that become vague after aggressive trimming. That usually means you cut too hard on chunk size or retrieval depth; increase chunkSize slightly before raising topK.
If you want real numbers, measure input tokens at the model layer and compare runs with different settings. The pattern you’re looking for is stable answer quality with fewer retrieved tokens per query.
Next Steps
- •Add metadata filters so retrieval only searches the right document class or business unit
- •Replace raw retrieval with a reranking step for better precision at lower
topK - •Add caching for repeated queries and shared retrieval results
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit