LangChain Tutorial (Python): optimizing token usage for advanced developers
This tutorial shows you how to reduce token spend in a LangChain Python app without breaking answer quality. You’ll build a retrieval pipeline that trims prompt bloat, caps context, and uses token-aware summarization so the model only sees what it needs.
What You'll Need
- •Python 3.10+
- •A virtual environment
- •
langchain - •
langchain-openai - •
langchain-community - •
tiktoken - •An OpenAI API key set as
OPENAI_API_KEY - •A small local text corpus for testing, such as a few
.txtfiles
Step-by-Step
- •Start by installing the packages and loading your API key through the environment. Keep the model choice explicit so you can compare token usage across runs.
pip install langchain langchain-openai langchain-community tiktoken
export OPENAI_API_KEY="your-api-key"
- •Build a retrieval index with chunk sizes that match your use case. Smaller chunks reduce irrelevant context, but if you go too small you pay more overhead in embeddings and retrieval fan-out.
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
loader = TextLoader("policy.txt", encoding="utf-8")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=700,
chunk_overlap=100,
)
chunks = splitter.split_documents(docs)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
- •Add token-aware compression before the final prompt. This keeps retrieved chunks from bloating the context window when the source material is verbose.
from langchain_openai import ChatOpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=retriever,
)
question = "What is the claims escalation process?"
compressed_docs = compression_retriever.invoke(question)
for i, doc in enumerate(compressed_docs, start=1):
print(f"\n--- Chunk {i} ---\n{doc.page_content[:800]}")
- •Use a compact prompt and keep output bounded. Most teams waste tokens by sending long system instructions and asking for open-ended responses when a structured answer would do.
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages([
("system", "Answer using only the provided context. If missing, say you don't know."),
("human", "Question: {question}\n\nContext:\n{context}\n\nReturn 3 bullet points max.")
])
context_text = "\n\n".join(doc.page_content for doc in compressed_docs)
messages = prompt.format_messages(question=question, context=context_text)
response = llm.invoke(messages)
print(response.content)
- •Add token counting so you can measure before you optimize. If you do not instrument this, you will guess where the spend is coming from and usually guess wrong.
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o-mini") -> int:
encoding_name = "o200k_base" if "gpt-4o" in model else "cl100k_base"
enc = tiktoken.get_encoding(encoding_name)
return len(enc.encode(text))
prompt_tokens = sum(count_tokens(m.content) for m in messages if hasattr(m, "content"))
context_tokens = count_tokens(context_text)
print("Prompt tokens:", prompt_tokens)
print("Context tokens:", context_tokens)
print("Answer tokens:", count_tokens(response.content))
- •Wrap it into a reusable chain so every request follows the same low-token path. This is where most production savings come from: consistent chunking, capped retrieval, compression, and short outputs.
from langchain_core.runnables import RunnablePassthrough
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
qa_chain = (
{
"context": compression_retriever | format_docs,
"question": RunnablePassthrough(),
}
| prompt
| llm
)
result = qa_chain.invoke("What is the claims escalation process?")
print(result.content)
Testing It
Run the script against one document that has both relevant and irrelevant sections. You should see fewer retrieved chunks after compression than raw retrieval, and your final prompt should stay much smaller than a naive “stuff everything into context” approach.
Check three things:
- •The answer still cites the right policy details.
- •The compressed context is shorter than the original retrieved text.
- •Repeated queries produce stable output lengths instead of drifting upward.
If you want a hard check, log total input tokens per request before and after compression. In practice, that number should drop noticeably once you cap k, compress retrieved docs, and force short outputs.
Next Steps
- •Add
max_tokenslimits per route so summaries, classifications, and extraction tasks each have different budgets. - •Replace raw retrieval with reranking plus compression when your corpus gets noisy.
- •Instrument LangSmith traces to track token usage by chain step instead of only at the request level.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit