AutoGen Tutorial (Python): optimizing token usage for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
autogenoptimizing-token-usage-for-intermediate-developerspython

This tutorial shows you how to reduce token usage in a Python AutoGen workflow without breaking agent behavior. You’ll learn how to trim prompts, cap conversation growth, and route only the minimum context needed between agents.

What You'll Need

  • Python 3.10+
  • autogen-agentchat installed
  • An OpenAI API key in OPENAI_API_KEY
  • Basic familiarity with AutoGen agents and chats
  • A terminal and a virtual environment

Install the package:

pip install autogen-agentchat

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

  1. Start by using a small, explicit system message. Most token waste comes from long instructions that repeat the same constraints in every turn, so keep the agent’s role narrow and task-specific.
from autogen_agentchat.agents import AssistantAgent
from autogen_ext.models.openai import OpenAIChatCompletionClient

model_client = OpenAIChatCompletionClient(
    model="gpt-4o-mini",
)

assistant = AssistantAgent(
    name="support_assistant",
    model_client=model_client,
    system_message=(
        "You are a support assistant. "
        "Answer in under 120 words. "
        "If details are missing, ask one clarifying question."
    ),
)
  1. Next, cap how much conversation history you send forward. If you keep feeding the full chat transcript back into the model, token usage grows linearly even when most of that history is irrelevant.
from autogen_agentchat.messages import TextMessage

messages = [
    TextMessage(content="Customer asks about refund timing.", source="user"),
    TextMessage(content="Refunds take 5-7 business days.", source="support_assistant"),
    TextMessage(content="Customer asks if weekends count.", source="user"),
]

# Keep only the last user message plus a short summary.
trimmed_messages = [
    TextMessage(content="Customer asks if weekends count.", source="user")
]
  1. Then summarize older context before it gets expensive. In production, this is usually better than hard truncation because you preserve intent while dropping verbose back-and-forth.
summary = (
    "Customer is asking about refund processing time. "
    "They already know refunds take 5-7 business days."
)

prompt = (
    f"Context summary: {summary}\n\n"
    "Respond to the latest customer question: "
    "Do weekends count toward refund processing time?"
)

response = await assistant.run(task=prompt)
print(response.messages[-1].content)
  1. Use structured outputs for tasks that need precision. Free-form answers tend to drift and add extra tokens, while concise schemas force the model to return only what your app needs.
from pydantic import BaseModel, Field

class RefundAnswer(BaseModel):
    answer: str = Field(description="Short customer-facing answer")
    needs_human_review: bool = Field(description="Whether escalation is required")

structured_assistant = AssistantAgent(
    name="structured_support",
    model_client=model_client,
    system_message="Return concise support answers.",
    output_content_type=RefundAnswer,
)

result = await structured_assistant.run(
    task="A customer wants to know whether weekends count toward refund timing."
)

print(result.messages[-1].content)
  1. Finally, split work across agents only when it saves tokens overall. A common mistake is letting every agent see everything; instead, pass each agent just the slice of context it needs.
triage_agent = AssistantAgent(
    name="triage_agent",
    model_client=model_client,
    system_message=(
        "Classify the request into billing, technical, or account issue. "
        "Return one label only."
    ),
)

billing_agent = AssistantAgent(
    name="billing_agent",
    model_client=model_client,
    system_message=(
        "Handle billing questions only. "
        "Keep answers short and concrete."
    ),
)

triage_result = await triage_agent.run(task="Customer asks when a refund will arrive.")
label = triage_result.messages[-1].content.strip().lower()

if label == "billing":
    billing_result = await billing_agent.run(task="Customer asks when a refund will arrive.")
    print(billing_result.messages[-1].content)

Testing It

Run a few real prompts and compare token-heavy behavior against your previous setup. You should see shorter responses, fewer repeated instructions, and less growth as the conversation continues.

Test three cases: a single-turn question, a multi-turn follow-up, and an escalation scenario that needs structured output. If your summaries are too vague or your system prompts are too restrictive, quality will drop before token savings become meaningful.

If you want hard numbers, log input and output token counts from your model provider dashboard or client telemetry. The main signal you’re looking for is stable token usage across turns instead of runaway transcript growth.

Next Steps

  • Add automatic conversation summarization after every N turns
  • Cache stable retrieval results so agents don’t re-read the same context
  • Learn how to use tool calls to move deterministic work out of the prompt

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides