LangChain Tutorial (Python): optimizing token usage for beginners
This tutorial shows you how to reduce token usage in a LangChain Python app by trimming prompts, limiting retrieved context, and controlling output size. You need this when your chain works but bills are creeping up, latency is high, or you’re sending too much irrelevant text to the model.
What You'll Need
- •Python 3.10+
- •An OpenAI API key
- •
langchain - •
langchain-openai - •
langchain-community - •
tiktoken - •Basic familiarity with
ChatPromptTemplate,Runnable, and vector retrieval
Install the packages:
pip install langchain langchain-openai langchain-community tiktoken
Set your API key:
export OPENAI_API_KEY="your-key-here"
Step-by-Step
- •Start by measuring token usage before optimizing anything. If you don’t measure, you’ll guess wrong and over-optimize the wrong part of the chain.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("human", "Explain token optimization in LangChain in simple terms.")
])
messages = prompt.format_messages()
response = llm.invoke(messages)
print(response.content)
print("Usage:", response.response_metadata.get("token_usage"))
- •Cut prompt bloat by removing repeated instructions and unnecessary context. Beginners often stuff the same policy text into every call; that burns tokens fast and adds noise.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer concisely. Use bullet points when helpful."),
("human", "{question}")
])
question = "What are three ways to reduce token usage in LangChain?"
result = llm.invoke(prompt.format_messages(question=question))
print(result.content)
- •Limit output length explicitly. A lot of token waste comes from models generating long answers when you only need a short response or a structured result.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, max_tokens=120)
prompt = ChatPromptTemplate.from_messages([
("system", "You are concise."),
("human", "Summarize these three ideas: prompt trimming, retrieval filtering, output limits.")
])
response = llm.invoke(prompt.format_messages())
print(response.content)
print("Usage:", response.response_metadata.get("token_usage"))
- •When using retrieval, fetch fewer documents and keep them smaller. The fastest way to waste tokens is to dump full documents into the prompt when only a few chunks are relevant.
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import FakeEmbeddings
from langchain_core.documents import Document
docs = [
Document(page_content="Token usage drops when prompts are shorter."),
Document(page_content="Use top_k=2 instead of top_k=5 for tighter retrieval."),
Document(page_content="Summarize long chunks before passing them to the model."),
]
vectorstore = FAISS.from_documents(docs, FakeEmbeddings(size=8))
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
results = retriever.invoke("How do I reduce token usage?")
for doc in results:
print(doc.page_content)
- •Put it together in a small chain that uses short prompts, limited retrieval, and bounded output. This is the pattern you want in production: small inputs, relevant context, controlled generation.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import FakeEmbeddings
from langchain_core.documents import Document
docs = [
Document(page_content="Keep prompts short and remove repeated instructions."),
Document(page_content="Retrieve only the top 2 relevant chunks."),
Document(page_content="Set max_tokens to cap output length."),
]
vectorstore = FAISS.from_documents(docs, FakeEmbeddings(size=8))
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, max_tokens=100)
question = "How should I optimize token usage?"
context_docs = retriever.invoke(question)
context_text = "\n".join(doc.page_content for doc in context_docs)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer in 3 bullets max."),
("human", "Question: {question}\n\nContext:\n{context}")
])
response = llm.invoke(prompt.format_messages(
question=question,
context=context_text,
))
print(response.content)
Testing It
Run each snippet and inspect the token_usage metadata on responses that come from the OpenAI chat model. You should see lower input tokens after trimming prompts and fewer output tokens after setting max_tokens.
For retrieval-based chains, compare k=5 versus k=2 and watch how much shorter the assembled context becomes. If your answers still stay accurate with fewer chunks, you’ve found easy savings.
Also test with a real user query that would normally trigger long responses. If the answer stays useful while staying short, your token controls are doing their job.
Next Steps
- •Learn message compression patterns with
RunnableLambdaand custom summarizers. - •Add document chunking strategies so retrieved text stays small before it reaches the prompt.
- •Track per-request token usage in logs so you can catch regressions early.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit