LangGraph Tutorial (Python): optimizing token usage for intermediate developers
This tutorial shows you how to build a LangGraph workflow in Python that keeps token usage under control without giving up useful agent behavior. You’ll see how to trim state, summarize history, and route work so the model only sees what it actually needs.
What You'll Need
- •Python 3.10+
- •
langgraph - •
langchain-openai - •
langchain-core - •An OpenAI API key in
OPENAI_API_KEY - •Basic familiarity with LangGraph state, nodes, and edges
Install the packages:
pip install langgraph langchain-openai langchain-core
Set your API key:
export OPENAI_API_KEY="your-key-here"
Step-by-Step
- •Start with a minimal graph state that stores only the fields you need. Token waste usually starts when people dump full chat history into every node, so keep the state narrow from the beginning.
from typing import TypedDict, Annotated
from operator import add
from langgraph.graph import StateGraph, START, END
from langchain_core.messages import BaseMessage
class GraphState(TypedDict):
messages: Annotated[list[BaseMessage], add]
summary: str
user_query: str
route: str
- •Add a summarizer node that compresses old conversation into a short running summary. This is the main trick: instead of sending every prior message back to the model, you keep a compact memory string and only pass recent turns.
import os
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def summarize_state(state: GraphState):
recent = state["messages"][-6:]
prompt = [
SystemMessage(content="Summarize the conversation in 5 bullet points max. Keep names, decisions, and open questions."),
HumanMessage(content=f"Current summary:\n{state.get('summary', '')}\n\nRecent messages:\n{recent}")
]
summary = llm.invoke(prompt).content
return {"summary": summary}
- •Route requests before calling expensive nodes. If the query is simple, send it to a short-answer path; if it needs context, use the summary plus recent messages instead of the full log.
def classify_route(state: GraphState):
query = state["user_query"].lower()
if any(word in query for word in ["latest", "current", "status", "summary"]):
return {"route": "context"}
return {"route": "direct"}
def direct_answer(state: GraphState):
response = llm.invoke([
SystemMessage(content="Answer concisely in 3 sentences max."),
HumanMessage(content=state["user_query"])
])
return {"messages": [response]}
def context_answer(state: GraphState):
response = llm.invoke([
SystemMessage(content="Use the summary and recent messages. Answer concisely."),
HumanMessage(content=f"Summary:\n{state.get('summary', '')}\n\nQuestion:\n{state['user_query']}")
])
return {"messages": [response]}
- •Build conditional edges so only one expensive path runs per request. This prevents “always-on” context loading, which is one of the fastest ways to burn tokens in production.
def choose_path(state: GraphState):
return state["route"]
builder = StateGraph(GraphState)
builder.add_node("classify_route", classify_route)
builder.add_node("summarize_state", summarize_state)
builder.add_node("direct_answer", direct_answer)
builder.add_node("context_answer", context_answer)
builder.add_edge(START, "classify_route")
builder.add_edge("classify_route", "summarize_state")
builder.add_conditional_edges(
"summarize_state",
choose_path,
{
"direct": "direct_answer",
"context": "context_answer",
},
)
builder.add_edge("direct_answer", END)
builder.add_edge("context_answer", END)
graph = builder.compile()
- •Run the graph with a small input payload and inspect what comes back. In production, you’d also track prompt size and response size per node so you can spot regressions early.
from langchain_core.messages import HumanMessage
result = graph.invoke({
"messages": [HumanMessage(content="We discussed claim fraud detection yesterday.")],
"summary": "",
"user_query": "What is the latest status?",
})
print(result["route"])
print(result["summary"])
print(result["messages"][-1].content)
Testing It
Run one test with a context-heavy question like “What is the latest status?” and another with a direct question like “Define fraud scoring.” The first should take the summary-aware path, while the second should stay short and avoid unnecessary context expansion.
If you want to verify token savings properly, log input/output token counts from your model responses or wrap calls with your own telemetry. The important check is not just correctness; it’s whether repeated turns stop growing linearly in prompt size.
A good smoke test is to feed 20 synthetic messages into messages and confirm your summarizer still keeps answers usable without sending all 20 turns to the model.
Next Steps
- •Add message windowing so only the last N turns are retained alongside the summary.
- •Use structured outputs with Pydantic models to reduce verbose free-form responses.
- •Instrument per-node token usage and set alerts for abnormal prompt growth.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit