LangGraph Tutorial (Python): optimizing token usage for beginners

By Cyprian AaronsUpdated 2026-04-21
langgraphoptimizing-token-usage-for-beginnerspython

This tutorial shows you how to build a small LangGraph workflow in Python that keeps token usage under control by trimming state, summarizing history, and avoiding unnecessary model calls. You need this when your graph starts growing conversation state or tool output and your LLM bills begin rising for no good reason.

What You'll Need

  • Python 3.10+
  • langgraph
  • langchain-openai
  • langchain-core
  • An OpenAI API key set as OPENAI_API_KEY
  • Basic familiarity with LangGraph nodes, edges, and state

Install the packages:

pip install langgraph langchain-openai langchain-core

Step-by-Step

  1. Start with a minimal graph state that stores only what the model actually needs. The main mistake beginners make is carrying full chat history through every node when a short summary would do.
from typing import Annotated, TypedDict

from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_core.messages import HumanMessage, AIMessage

class State(TypedDict):
    messages: Annotated[list, add_messages]
    summary: str
    token_budget: int
  1. Add a compacting node that summarizes older messages once the conversation gets too long. This reduces prompt size before the next LLM call and is the simplest reliable token-saving pattern.
import os
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def summarize_if_needed(state: State):
    messages = state["messages"]
    if len(messages) <= 4:
        return {}

    recent = messages[-2:]
    older = messages[:-2]
    prompt = [
        HumanMessage(content=f"Summarize these messages in 3 bullet points:\n{older}"),
    ]
    summary_msg = llm.invoke(prompt)
    return {
        "summary": summary_msg.content,
        "messages": recent,
    }
  1. Build the answer node so it uses the summary plus only the recent turns. This keeps the prompt small while still preserving enough context for good answers.
def answer(state: State):
    summary = state.get("summary", "")
    recent_messages = state["messages"][-2:]

    system_text = (
        "You are a concise assistant.\n"
        f"Conversation summary: {summary}\n"
        "Use only the recent messages and summary."
    )

    response = llm.invoke(
        [{"role": "system", "content": system_text}, *recent_messages]
    )
    return {"messages": [response]}
  1. Wire the graph so it compacts first, then answers. In production, this pattern prevents every node from seeing raw history unless it truly needs it.
builder = StateGraph(State)

builder.add_node("compact", summarize_if_needed)
builder.add_node("answer", answer)

builder.add_edge(START, "compact")
builder.add_edge("compact", "answer")
builder.add_edge("answer", END)

graph = builder.compile()
  1. Run the graph with a small initial state and inspect how much context survives after compaction. The important part is that only a subset of messages gets forwarded after the threshold is crossed.
initial_state: State = {
    "messages": [
        HumanMessage(content="My policy renewal failed."),
        AIMessage(content="What error did you see?"),
        HumanMessage(content="It said invalid payment method."),
        AIMessage(content="Try updating the card."),
        HumanMessage(content="I updated it but still get rejected."),
    ],
    "summary": "",
    "token_budget": 1000,
}

result = graph.invoke(initial_state)
print(result["summary"])
print(result["messages"][-1].content)
  1. If you want stricter control, gate expensive work behind a simple budget check before calling the model again. Beginners often skip this and let every branch call an LLM even when no new information was added.
def should_answer(state: State):
    if len(state["messages"]) < 2:
        return END
    return "answer"

budget_builder = StateGraph(State)
budget_builder.add_node("compact", summarize_if_needed)
budget_builder.add_node("answer", answer)

budget_builder.add_edge(START, "compact")
budget_builder.add_conditional_edges("compact", should_answer)
budget_builder.add_edge("answer", END)

budget_graph = budget_builder.compile()

Testing It

Run the script with OPENAI_API_KEY exported in your shell and confirm that the graph returns an answer without error. Then increase the number of input turns and verify that summary starts filling in while messages gets trimmed down to just recent content.

To check token savings, compare prompt length before and after compaction by printing the serialized message count or using your provider’s usage metadata if available. If you see every turn being sent back to the model unchanged, your compaction step is not firing early enough.

A good smoke test is to feed in 10-15 alternating human/AI turns and confirm that only a small tail of messages reaches the final answer node. That tells you your graph is controlling context growth instead of letting it compound.

Next Steps

  • Add token-aware routing using a real tokenizer or provider usage metadata.
  • Replace naive summarization with structured memory fields like facts, open_questions, and decisions.
  • Learn LangGraph persistence so summaries survive across sessions without reprocessing old turns.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides