AutoGen Tutorial (Python): optimizing token usage for beginners

By Cyprian AaronsUpdated 2026-04-21

autogenoptimizing-token-usage-for-beginnerspython

This tutorial shows you how to reduce token usage in a Python AutoGen setup without breaking the agent workflow. You need this when your conversations get expensive, slow, or noisy because agents are sending too much context back and forth.

What You'll Need

•Python 3.10+
•autogen-agentchat installed
•An OpenAI API key set in your environment
•Basic familiarity with AutoGen agents and chat loops
•A terminal and a text editor

Install the package:

pip install autogen-agentchat

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

•Start with a small, explicit model config.

The fastest way to waste tokens is to use a large model for everything. For beginner setups, define one model config and keep the conversation scope tight.

import os
from autogen_agentchat.agents import AssistantAgent

model_client_config = {
    "model": "gpt-4o-mini",
    "api_key": os.environ["OPENAI_API_KEY"],
}

agent = AssistantAgent(
    name="assistant",
    model_client_config=model_client_config,
    system_message="You are a concise assistant. Keep answers short.",
)

•Limit the amount of conversation history you send.

AutoGen agents can accumulate context quickly. For simple tasks, keep the thread short and avoid carrying old messages into every new request.

from autogen_agentchat.messages import TextMessage

message_1 = TextMessage(content="Summarize this invoice in 3 bullets.", source="user")
message_2 = TextMessage(content="Now extract only the total amount.", source="user")

# Send only what you need for the current task.
# Avoid replaying long prior threads unless they are required.
response = agent.run(task=message_2.content)
print(response)

•Use tighter system instructions.

Verbose system prompts cost tokens on every turn. Replace long policy text with short, operational instructions that tell the agent exactly how to respond.

agent = AssistantAgent(
    name="assistant",
    model_client_config=model_client_config,
    system_message=(
        "Answer in 3 bullets max. "
        "If data is missing, say 'missing'. "
        "Do not explain your reasoning."
    ),
)

•Add a lightweight preprocessor before calling the model.

Trim user input before it reaches AutoGen. This is useful when users paste logs, emails, or documents with lots of irrelevant text.

def compact_prompt(text: str, max_chars: int = 1200) -> str:
    text = " ".join(text.split())
    return text[:max_chars]

raw_text = """
Customer says the policy was renewed last month.
There are multiple paragraphs here...
"""

task = f"Extract renewal date from: {compact_prompt(raw_text)}"
result = agent.run(task=task)
print(result)

•Split work into small tasks instead of one giant prompt.

One large prompt usually burns more tokens than two focused prompts. Ask for extraction first, then ask for formatting or summarization only after you have the needed fields.

extract_task = "From this note, extract customer name and renewal date: John Doe renewed on 2024-10-12."
extract_result = agent.run(task=extract_task)

format_task = f"Format this as JSON with keys name and renewal_date: {extract_result}"
format_result = agent.run(task=format_task)

print(format_result)

•Stop early when you already have the answer.

If you are using multi-agent workflows, do not let agents debate forever. Set clear output limits and terminate once the required field is found.

from autogen_agentchat.messages import TextMessage

task = TextMessage(
    content="Return only the policy number from: Policy number is POL-88319.",
    source="user",
)

result = agent.run(task=task.content)
print(result)

Testing It

Run each snippet and compare the length of the prompts you send before and after trimming. You should see shorter inputs, fewer tokens consumed per request, and faster responses for simple extraction tasks. If your outputs become too terse or miss details, increase only the specific limit that caused it instead of widening everything at once.

A practical check is to log raw input length versus compacted input length before calling agent.run(). If you are using OpenAI usage metadata in your stack, track prompt tokens per request over a few runs and confirm they drop after applying these changes.

Next Steps

•Learn AutoGen message routing so you can avoid resending unnecessary context between agents
•Add token logging around each run() call to measure real savings
•Explore structured outputs so extraction tasks return smaller, more reliable responses

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit