LangGraph Tutorial (Python): optimizing token usage for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
langgraphoptimizing-token-usage-for-intermediate-developerspython

This tutorial shows you how to build a LangGraph workflow in Python that keeps token usage under control without giving up useful agent behavior. You’ll see how to trim state, summarize history, and route work so the model only sees what it actually needs.

What You'll Need

  • Python 3.10+
  • langgraph
  • langchain-openai
  • langchain-core
  • An OpenAI API key in OPENAI_API_KEY
  • Basic familiarity with LangGraph state, nodes, and edges

Install the packages:

pip install langgraph langchain-openai langchain-core

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

  1. Start with a minimal graph state that stores only the fields you need. Token waste usually starts when people dump full chat history into every node, so keep the state narrow from the beginning.
from typing import TypedDict, Annotated
from operator import add

from langgraph.graph import StateGraph, START, END
from langchain_core.messages import BaseMessage

class GraphState(TypedDict):
    messages: Annotated[list[BaseMessage], add]
    summary: str
    user_query: str
    route: str
  1. Add a summarizer node that compresses old conversation into a short running summary. This is the main trick: instead of sending every prior message back to the model, you keep a compact memory string and only pass recent turns.
import os
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def summarize_state(state: GraphState):
    recent = state["messages"][-6:]
    prompt = [
        SystemMessage(content="Summarize the conversation in 5 bullet points max. Keep names, decisions, and open questions."),
        HumanMessage(content=f"Current summary:\n{state.get('summary', '')}\n\nRecent messages:\n{recent}")
    ]
    summary = llm.invoke(prompt).content
    return {"summary": summary}
  1. Route requests before calling expensive nodes. If the query is simple, send it to a short-answer path; if it needs context, use the summary plus recent messages instead of the full log.
def classify_route(state: GraphState):
    query = state["user_query"].lower()
    if any(word in query for word in ["latest", "current", "status", "summary"]):
        return {"route": "context"}
    return {"route": "direct"}

def direct_answer(state: GraphState):
    response = llm.invoke([
        SystemMessage(content="Answer concisely in 3 sentences max."),
        HumanMessage(content=state["user_query"])
    ])
    return {"messages": [response]}

def context_answer(state: GraphState):
    response = llm.invoke([
        SystemMessage(content="Use the summary and recent messages. Answer concisely."),
        HumanMessage(content=f"Summary:\n{state.get('summary', '')}\n\nQuestion:\n{state['user_query']}")
    ])
    return {"messages": [response]}
  1. Build conditional edges so only one expensive path runs per request. This prevents “always-on” context loading, which is one of the fastest ways to burn tokens in production.
def choose_path(state: GraphState):
    return state["route"]

builder = StateGraph(GraphState)
builder.add_node("classify_route", classify_route)
builder.add_node("summarize_state", summarize_state)
builder.add_node("direct_answer", direct_answer)
builder.add_node("context_answer", context_answer)

builder.add_edge(START, "classify_route")
builder.add_edge("classify_route", "summarize_state")
builder.add_conditional_edges(
    "summarize_state",
    choose_path,
    {
        "direct": "direct_answer",
        "context": "context_answer",
    },
)
builder.add_edge("direct_answer", END)
builder.add_edge("context_answer", END)

graph = builder.compile()
  1. Run the graph with a small input payload and inspect what comes back. In production, you’d also track prompt size and response size per node so you can spot regressions early.
from langchain_core.messages import HumanMessage

result = graph.invoke({
    "messages": [HumanMessage(content="We discussed claim fraud detection yesterday.")],
    "summary": "",
    "user_query": "What is the latest status?",
})

print(result["route"])
print(result["summary"])
print(result["messages"][-1].content)

Testing It

Run one test with a context-heavy question like “What is the latest status?” and another with a direct question like “Define fraud scoring.” The first should take the summary-aware path, while the second should stay short and avoid unnecessary context expansion.

If you want to verify token savings properly, log input/output token counts from your model responses or wrap calls with your own telemetry. The important check is not just correctness; it’s whether repeated turns stop growing linearly in prompt size.

A good smoke test is to feed 20 synthetic messages into messages and confirm your summarizer still keeps answers usable without sending all 20 turns to the model.

Next Steps

  • Add message windowing so only the last N turns are retained alongside the summary.
  • Use structured outputs with Pydantic models to reduce verbose free-form responses.
  • Instrument per-node token usage and set alerts for abnormal prompt growth.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides