LangChain Tutorial (TypeScript): optimizing token usage for intermediate developers
This tutorial shows you how to reduce token spend in a LangChain TypeScript app without breaking answer quality. You’ll build a small prompt pipeline that trims history, compresses context, and keeps your model calls within budget.
What You'll Need
- •Node.js 18+
- •A TypeScript project with
ts-nodeor a build step - •
langchain - •
@langchain/openai - •An OpenAI API key in
OPENAI_API_KEY - •Basic familiarity with LangChain
Runnablechains and message history
Step-by-Step
- •Start by installing the packages and wiring a chat model with conservative defaults. The main cost driver is usually prompt size, so we’ll keep the model config explicit from the start.
npm install langchain @langchain/openai zod
import { ChatOpenAI } from "@langchain/openai";
const model = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
maxTokens: 300,
});
console.log("Model ready:", model.model);
- •Build a prompt that keeps instructions short and pushes the model to answer with only what’s needed. In production, vague prompts waste tokens because the model fills in gaps you didn’t ask for.
import { ChatPromptTemplate } from "@langchain/core/prompts";
const prompt = ChatPromptTemplate.fromMessages([
["system", "You are a support assistant. Answer concisely in 5 bullets max."],
["human", "{question}"],
]);
const formatted = await prompt.formatMessages({
question: "How do I reset my password?",
});
console.log(formatted.map((m) => `${m._getType()}: ${m.content}`));
- •Add token-aware trimming for conversation history. This is the biggest win for chat apps because old turns accumulate fast and often add little value.
import { AIMessage, HumanMessage } from "@langchain/core/messages";
import { trimMessages } from "langchain/memory";
const history = [
new HumanMessage("Hi, I need help with my policy."),
new AIMessage("Sure, what do you need?"),
new HumanMessage("I want to update my address."),
new AIMessage("You can do that in settings."),
];
const trimmedHistory = await trimMessages(history, {
maxTokens: 50,
tokenCounter: async (messages) =>
messages.reduce((sum, msg) => sum + String(msg.content).split(/\s+/).length, 0),
strategy: "last",
});
console.log(trimmedHistory.map((m) => m.content));
- •Compress long retrieved context before it hits the final answer call. If you stuff raw documents into every prompt, your token bill will grow even when only one paragraph matters.
import { Document } from "@langchain/core/documents";
function compressDocs(docs: Document[]) {
return docs
.map((doc) => doc.pageContent.slice(0, 300))
.join("\n\n---\n\n");
}
const docs = [
new Document({
pageContent:
"Policy A covers accidental damage, theft, and fire. It excludes wear and tear, fraud, and intentional damage.",
}),
new Document({
pageContent:
"Policy B includes medical coverage abroad up to $50,000 and emergency evacuation.",
}),
];
const compressedContext = compressDocs(docs);
console.log(compressedContext);
- •Put it together in one runnable chain that trims history, uses short context, and limits output length. This pattern is what you want when building bank or insurance assistants where every extra token has a cost.
import { RunnablePassthrough } from "@langchain/core/runnables";
import { StringOutputParser } from "@langchain/core/output_parsers";
const chain = RunnablePassthrough.assign({
context: async () => compressedContext,
}).pipe(
prompt.pipe(model).pipe(new StringOutputParser())
);
const question = "Does Policy A cover accidental damage?";
const answer = await chain.invoke({ question });
console.log(answer);
- •Add basic observability so you can see whether your changes actually reduced usage. If you don’t measure input and output sizes before and after each change, you’re guessing.
function estimateTokens(text: string) {
return Math.ceil(text.split(/\s+/).length / 0.75);
}
const promptText = `Question: ${question}\nContext: ${compressedContext}`;
console.log({
estimatedInputTokens: estimateTokens(promptText),
estimatedOutputBudget: 300,
});
Testing It
Run the script against a real OpenAI key and compare the size of your prompts before and after trimming history or compressing documents. The output should still answer correctly while using fewer words in the prompt payload.
If you want a quick sanity check, print the formatted messages and verify that old chat turns are removed once they stop being useful. Also confirm that your final response stays within the maxTokens limit you set on the model.
For a more realistic test, feed in a long conversation plus several retrieved documents and watch how much shorter the final input becomes after trimming. In production, this usually translates into lower latency too.
Next Steps
- •Add proper token counting with a provider-specific tokenizer instead of word estimates
- •Move from manual compression to retrieval reranking before generation
- •Store conversation summaries in memory instead of raw message history
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit