AutoGen Tutorial (Python): optimizing token usage for advanced developers

By Cyprian AaronsUpdated 2026-04-21
autogenoptimizing-token-usage-for-advanced-developerspython

This tutorial shows you how to reduce token spend in AutoGen by controlling message history, summarizing agent context, and routing work to the smallest model that can do the job. You need this when your multi-agent workflows are technically correct but too expensive, too slow, or hitting context limits before the task is done.

What You'll Need

  • Python 3.10+
  • pyautogen installed
  • An OpenAI-compatible API key
  • Access to at least one cheaper model for lightweight tasks
  • A terminal and a working Python virtual environment

Install the package:

pip install pyautogen

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

  1. Start by defining two model configs: one cheaper model for routine turns and one stronger model for synthesis. The main token-saving pattern is not “use one big model everywhere,” but “use the right model for each stage.”
import os
from autogen import AssistantAgent, UserProxyAgent

config_list = [
    {
        "model": "gpt-4o-mini",
        "api_key": os.environ["OPENAI_API_KEY"],
    },
    {
        "model": "gpt-4o",
        "api_key": os.environ["OPENAI_API_KEY"],
    },
]

cheap_llm_config = {
    "config_list": [config_list[0]],
    "temperature": 0,
}

strong_llm_config = {
    "config_list": [config_list[1]],
    "temperature": 0,
}
  1. Create a worker agent with a short system message and keep its job narrow. Shorter instructions mean fewer prompt tokens on every turn, which matters a lot in long-running chats.
worker = AssistantAgent(
    name="worker",
    llm_config=cheap_llm_config,
    system_message=(
        "You are a concise coding assistant. "
        "Answer in bullets, avoid repetition, and only include necessary details."
    ),
)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
    code_execution_config=False,
)
  1. Use a bounded chat history and summarize aggressively before context grows too large. In AutoGen, the practical move is to trim conversation scope instead of letting every turn accumulate forever.
from collections import deque

history = deque(maxlen=6)

def record(role: str, content: str) -> None:
    history.append({"role": role, "content": content})

record("user", "Review this function for performance issues.")
record("assistant", "It has an O(n^2) loop; use a hash map.")

print(list(history))
  1. Add a summarizer agent that compresses intermediate results before handing them to the stronger model. This keeps expensive models focused on decisions instead of rereading raw conversation logs.
summarizer = AssistantAgent(
    name="summarizer",
    llm_config=cheap_llm_config,
    system_message=(
        "Summarize the conversation into 5 bullets max. "
        "Keep only decisions, constraints, and open questions."
    ),
)

summary_prompt = """
Summarize this interaction:
- User wants token optimization in AutoGen.
- Worker suggested shorter prompts and bounded history.
- Need a production-ready workflow.
"""

summary_result = user.initiate_chat(
    summarizer,
    message=summary_prompt,
)
print(summary_result.chat_history[-1]["content"])
  1. Route final synthesis to the stronger model only after compression. This pattern gives you better answer quality without paying premium-token prices for every intermediate step.
synthesizer = AssistantAgent(
    name="synthesizer",
    llm_config=strong_llm_config,
    system_message=(
        "You produce final answers from compressed notes. "
        "Do not restate raw chat history."
    ),
)

compressed_notes = (
    "User needs an AutoGen token optimization tutorial.\n"
    "- Use cheap model for routine turns.\n"
    "- Bound history.\n"
    "- Summarize before synthesis.\n"
)

result = user.initiate_chat(
    synthesizer,
    message=f"Turn these notes into an implementation plan:\n{compressed_notes}",
)
print(result.chat_history[-1]["content"])

Testing It

Run the script and confirm each agent returns a response without errors. The key thing to check is that the worker and summarizer use the cheaper config while only the final synthesizer uses the stronger model.

Then inspect your provider dashboard or usage logs if available. You should see fewer tokens consumed per turn because you are not feeding full transcripts into every agent call.

If you want a practical sanity check, increase the prompt length and rerun it twice: once with full history passed through, once with summarized notes only. The summarized version should be noticeably cheaper and faster.

Next Steps

  • Add automatic transcript trimming based on token count instead of fixed message count.
  • Introduce tool routing so only retrieval or code execution tasks hit heavier agents.
  • Learn how to cache repeated system prompts and static policy text across agent calls

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides