LangChain Tutorial (Python): optimizing token usage for intermediate developers
This tutorial shows you how to reduce token usage in a LangChain Python app without breaking retrieval quality or response quality. You’ll build a small RAG pipeline that trims prompt size, limits retrieved context, and keeps chat history under control.
What You'll Need
- •Python 3.10+
- •
langchain - •
langchain-openai - •
langchain-community - •
tiktoken - •OpenAI API key in
OPENAI_API_KEY - •A few local text files for testing, or your own document corpus
- •Basic familiarity with LangChain chains, retrievers, and chat models
Install the packages:
pip install langchain langchain-openai langchain-community tiktoken
Step-by-Step
- •Start by measuring token usage before you optimize anything. If you do not measure first, you will guess wrong about where the waste is coming from.
import os
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
messages = [
SystemMessage(content="You are a concise assistant."),
HumanMessage(content="Explain how token optimization works in LangChain in one paragraph."),
]
response = llm.invoke(messages)
print(response.content)
print("Prompt tokens:", response.response_metadata["token_usage"]["prompt_tokens"])
print("Completion tokens:", response.response_metadata["token_usage"]["completion_tokens"])
- •Build a retrieval pipeline that chunks documents more aggressively and retrieves fewer chunks. Smaller chunks and lower
kusually cut prompt size immediately.
from pathlib import Path
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
docs = []
for file_path in Path("./docs").glob("*.txt"):
docs.extend(TextLoader(str(file_path), encoding="utf-8").load())
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
- •Use a compact prompt template and keep retrieved context tightly formatted. Most token waste comes from verbose instructions and dumping raw documents into the model.
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
prompt = ChatPromptTemplate.from_messages([
("system", "Answer using only the provided context. Be concise."),
("human", "Question: {question}\n\nContext:\n{context}")
])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def format_docs(docs):
return "\n\n".join(
f"[Source: {d.metadata.get('source', 'unknown')}]\n{d.page_content}"
for d in docs[:3]
)
- •Wire the retriever and prompt together with a simple chain. This keeps the input surface small and makes it obvious where tokens are spent.
from langchain_core.runnables import RunnablePassthrough
rag_chain = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough(),
}
| prompt
| llm
)
result = rag_chain.invoke("What does the policy say about document retention?")
print(result.content)
print("Tokens:", result.response_metadata["token_usage"])
- •Add chat history trimming if your app is conversational. History grows fast, so keep only the last few turns or summarize older messages before sending them back to the model.
from langchain_core.messages import trim_messages
history = [
("human", "We discussed retention rules."),
("ai", "Yes, retention is 7 years."),
("human", "What about exceptions?"),
("ai", "Exceptions apply to legal holds."),
]
trimmed_history = trim_messages(
history,
max_tokens=120,
token_counter=llm,
strategy="last",
)
for msg in trimmed_history:
print(msg.type, msg.content)
Testing It
Run the RAG chain against a question that should be answered by one or two chunks only. Check that the retrieved context is short, the answer stays relevant, and the token usage metadata is lower than your baseline.
Then compare k=3 versus k=8, or chunk_size=500 versus chunk_size=1500, and watch prompt tokens change. If your answers get worse after shrinking context, increase chunk overlap slightly before increasing k.
For conversational apps, send a long message history through trimming and confirm older turns disappear while recent ones remain intact. The goal is not zero context; it is enough context to answer correctly without paying for dead weight.
Next Steps
- •Add an output parser that forces short structured answers instead of verbose prose.
- •Use contextual compression retrievers to shrink retrieved documents before prompting.
- •Build token budget checks into tests so prompt growth fails CI before it hits production.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit