LangChain Tutorial (TypeScript): optimizing token usage for advanced developers

By Cyprian AaronsUpdated 2026-04-21
langchainoptimizing-token-usage-for-advanced-developerstypescript

This tutorial shows you how to reduce token spend in a LangChain TypeScript app without breaking output quality. You’ll build a pattern that trims input, caps retrieval, compresses context, and measures token usage so you can control cost before it hits production.

What You'll Need

  • Node.js 18+
  • A TypeScript project with ts-node or tsx
  • langchain installed
  • @langchain/openai installed
  • OpenAI API key in OPENAI_API_KEY
  • Optional but useful:
    • dotenv for local env loading
    • A basic understanding of Runnable chains and prompt templates

Step-by-Step

  1. Start with a model configuration that makes token usage visible and predictable. For optimization work, you want a model wrapper you can reuse across chains instead of scattering settings everywhere.
import "dotenv/config";
import { ChatOpenAI } from "@langchain/openai";

export const model = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
  maxTokens: 300,
});

const response = await model.invoke("Summarize this in one sentence: LangChain helps compose LLM apps.");
console.log(response.content);
  1. Trim your prompt before it reaches the model. Most token waste comes from bloated system text, repeated instructions, and oversized user payloads that should have been normalized earlier.
type Input = {
  customerName: string;
  policyType: string;
  issue: string;
  notes: string[];
};

function buildCompactInput(input: Input) {
  return [
    `customer=${input.customerName}`,
    `policy=${input.policyType}`,
    `issue=${input.issue}`,
    `notes=${input.notes.slice(0, 3).join(" | ")}`,
  ].join("\n");
}

const compact = buildCompactInput({
  customerName: "Amina Patel",
  policyType: "Home Insurance",
  issue: "Claim status not updated",
  notes: ["Called support twice", "Uploaded documents", "Waiting for adjuster"],
});

console.log(compact);
  1. Use a tight prompt template and avoid verbose chat history when you only need the last few turns. In production, the cheapest token is the one you never send.
import { ChatPromptTemplate } from "@langchain/core/prompts";

const prompt = ChatPromptTemplate.fromMessages([
  ["system", "You are a claims assistant. Answer concisely in under 80 words."],
  ["human", "{input}"],
]);

const chainInput = buildCompactInput({
  customerName: "Amina Patel",
  policyType: "Home Insurance",
  issue: "Claim status not updated",
  notes: ["Called support twice", "Uploaded documents", "Waiting for adjuster"],
});

const messages = await prompt.formatMessages({ input: chainInput });
console.log(messages.map((m) => m.content));
  1. Cap retrieval at the source. If you use vector search or document lookup, limit both document count and chunk size so the LLM sees only the minimum evidence needed to answer well.
import { Document } from "@langchain/core/documents";

const docs = [
  new Document({ pageContent: "Claims are reviewed within 5 business days." }),
  new Document({ pageContent: "Missing documents delay claim review." }),
  new Document({ pageContent: "Customers can check status in the portal." }),
];

function topDocsForAnswer(input: string, documents: Document[]) {
  return documents.slice(0, Math.min(2, documents.length)).map((d) => d.pageContent);
}

const context = topDocsForAnswer("claim status update", docs).join("\n");
console.log(context);
  1. Add a compression step before final generation when retrieved context is still too large. This is where advanced teams save money: summarize raw evidence into a smaller intermediate representation, then answer from that summary.
import { StringOutputParser } from "@langchain/core/output_parsers";

const compressorPrompt = ChatPromptTemplate.fromMessages([
  ["system", "Compress the context into short factual bullets only."],
  ["human", "{context}"],
]);

const compressorChain = compressorPrompt.pipe(model).pipe(new StringOutputParser());

const compressedContext = await compressorChain.invoke({
  context:
    "Claims are reviewed within 5 business days. Missing documents delay claim review. Customers can check status in the portal.",
});

console.log(compressedContext);
  1. Measure usage so you can prove the savings. If you don’t track tokens per request, you’re guessing; once you log them, you can compare prompts, models, and retrieval settings objectively.
import { RunnableLambda } from "@langchain/core/runnables";

const answerPrompt = ChatPromptTemplate.fromMessages([
  ["system", "You answer claims questions briefly and accurately."],
  ["human", "{context}\n\nQuestion: {question}"],
]);

const answerChain = answerPrompt.pipe(model);

const trackedChain = RunnableLambda.from(async (input: { context: string; question: string }) => {
  const result = await answerChain.invoke(input);
  console.log("usage_metadata:", result.response_metadata?.tokenUsage ?? result.response_metadata?.usage);
  return result.content;
});

const finalAnswer = await trackedChain.invoke({
  context:
    "Claims are reviewed within 5 business days.\nMissing documents delay claim review.\nCustomers can check status in the portal.",
  question: "Why is my claim delayed?",
});

console.log(finalAnswer);

Testing It

Run the script with a real OpenAI key and inspect both the output quality and the logged token metadata. You should see shorter prompts produce shorter usage numbers while still answering correctly.

Test three variants back to back:

  • full raw context
  • trimmed input plus capped retrieval
  • trimmed input plus compressed context

If your answers stay stable while token counts drop, the optimization is working. If quality falls off, increase compression slightly or preserve one more retrieved chunk instead of expanding everything again.

Next Steps

  • Add tiktoken or provider-side usage logging to compare prompt versions before deployment.
  • Build a retrieval pipeline with score thresholds so low-value chunks never enter the prompt.
  • Add evaluation tests for answer quality versus token cost using recorded production traces.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides