LangGraph Tutorial (Python): optimizing token usage for advanced developers

By Cyprian AaronsUpdated 2026-04-21
langgraphoptimizing-token-usage-for-advanced-developerspython

This tutorial shows you how to build a LangGraph workflow that actively reduces token usage without breaking agent quality. You need this when your graph starts doing too many full-context LLM calls, repeating retrieved text, or carrying irrelevant state across nodes.

What You'll Need

  • Python 3.10+
  • langgraph
  • langchain-openai
  • langchain-core
  • OpenAI API key set as OPENAI_API_KEY
  • Basic familiarity with LangGraph state graphs and message passing
  • A terminal and a virtual environment

Install the packages:

pip install langgraph langchain-openai langchain-core

Step-by-Step

  1. Start by using a compact state model instead of passing raw chat history everywhere. The main trick is to store only what each node needs, then summarize or trim aggressively before the next model call.
from typing import Annotated, TypedDict

from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages

class State(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]
    summary: str
    topic: str
  1. Build two models: one for expensive reasoning and one for cheap summarization. In production, this split is where most token savings come from, because you stop using your best model for every maintenance task.
from langchain_openai import ChatOpenAI

reasoner = ChatOpenAI(model="gpt-4o-mini", temperature=0)
summarizer = ChatOpenAI(model="gpt-4o-mini", temperature=0)
  1. Add a summarization node that compresses old context into a short running summary. This lets later nodes work from a few lines of durable context instead of replaying the entire conversation.
from langchain_core.messages import HumanMessage, SystemMessage

def summarize_state(state: State):
    prompt = [
        SystemMessage(content="Summarize the conversation in under 80 words. Keep only facts needed for future reasoning."),
        HumanMessage(content=f"Current summary:\n{state.get('summary', '')}\n\nMessages:\n{state['messages']}")
    ]
    result = summarizer.invoke(prompt)
    return {"summary": result.content}
  1. Add a routing node that decides whether the graph needs the full message list or just the summary. This avoids sending large histories into every branch, especially when the user asks follow-up questions that only need the compressed context.
def route(state: State):
    last = state["messages"][-1].content.lower()
    if any(word in last for word in ["recap", "summary", "what did we decide"]):
        return "answer_from_summary"
    return "answer_from_messages"
  1. Create separate answer nodes for summary-based and full-context responses. The summary path should be your default for lightweight follow-ups, while the full-context path is reserved for cases where recent details matter.
from langchain_core.messages import AIMessage

def answer_from_summary(state: State):
    prompt = [
        SystemMessage(content="Answer using only the summary. Be concise."),
        HumanMessage(content=f"Summary:\n{state.get('summary', '')}\n\nQuestion:\n{state['messages'][-1].content}")
    ]
    result = reasoner.invoke(prompt)
    return {"messages": [AIMessage(content=result.content)]}

def answer_from_messages(state: State):
    prompt = [
        SystemMessage(content="Answer using the conversation messages. Be concise."),
        *state["messages"]
    ]
    result = reasoner.invoke(prompt)
    return {"messages": [AIMessage(content=result.content)]}
  1. Wire the graph so it summarizes first, then routes to the cheapest valid answer path. The important part is that you are not blindly feeding every node the same payload; you are controlling context size at each edge.
from langgraph.graph import END, START, StateGraph

builder = StateGraph(State)

builder.add_node("summarize", summarize_state)
builder.add_node("answer_from_summary", answer_from_summary)
builder.add_node("answer_from_messages", answer_from_messages)

builder.add_edge(START, "summarize")
builder.add_conditional_edges("summarize", route, {
    "answer_from_summary": "answer_from_summary",
    "answer_from_messages": "answer_from_messages",
})
builder.add_edge("answer_from_summary", END)
builder.add_edge("answer_from_messages", END)

graph = builder.compile()
  1. Run it with a small input and inspect what gets returned. In a real app, you would also log prompt sizes per node so you can see exactly where tokens are being burned.
from langchain_core.messages import HumanMessage

result = graph.invoke({
    "messages": [HumanMessage(content="We decided to prioritize fraud alerts over account enrichment.")],
    "summary": "",
    "topic": "fraud"
})

print(result["messages"][-1].content)
print("Summary:", result["summary"])

Testing It

Run one request that contains several long messages, then ask a short follow-up like “what did we decide?” The follow-up should hit the summary path instead of replaying the whole history.

To verify token savings, compare prompt sizes before and after adding summarization by logging len(str(prompt)) or using your provider’s usage metadata if available. You should see lower input tokens on summary-based turns and fewer repeated instructions across nodes.

Also test a case where recent detail matters, such as “what was the last customer complaint?” That should force the full-message branch so you do not optimize away correctness.

Next Steps

  • Add a token budget gate that trims messages before every model call based on estimated input size.
  • Replace free-form summaries with structured state fields like decisions, open_questions, and entities.
  • Add observability with LangSmith so you can track prompt growth per node and catch regressions early.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides