How to Fix 'token limit exceeded when scaling' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-21

token-limit-exceeded-when-scalingautogenpython

What this error means

token limit exceeded when scaling in AutoGen usually means one of your agents is trying to send a conversation payload that is larger than the model context window. It shows up most often when you scale from a small demo to a real multi-agent workflow with long chat history, tool outputs, or recursive agent-to-agent handoffs.

In practice, this is not an AutoGen bug. It’s a state-management problem: too much text is being kept in memory and resent on every turn.

The Most Common Cause

The #1 cause is unbounded chat history. In AutoGen, AssistantAgent and UserProxyAgent keep conversation state unless you explicitly trim it. If you keep looping through initiate_chat() or reusing the same agent instances across many tasks, the prompt grows until the model rejects it with errors like:

•openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is ...'}}
•token limit exceeded when scaling
•RuntimeError: Token limit exceeded

Broken vs fixed pattern

Broken pattern	Fixed pattern
Reuse the same agent and keep appending history forever	Reset or trim history between runs
Send full tool output back into chat	Summarize or store externally
Let nested chats accumulate messages	Cap message count or use a summary buffer

# BROKEN: chat history grows every run
from autogen import AssistantAgent, UserProxyAgent

llm_config = {
    "config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}],
    "temperature": 0,
}

assistant = AssistantAgent(name="assistant", llm_config=llm_config)
user = UserProxyAgent(name="user", human_input_mode="NEVER")

for i in range(50):
    user.initiate_chat(
        assistant,
        message=f"Analyze transaction batch {i}. Include all details."
    )

# FIXED: reset history and keep prompts bounded
from autogen import AssistantAgent, UserProxyAgent

llm_config = {
    "config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}],
    "temperature": 0,
}

assistant = AssistantAgent(name="assistant", llm_config=llm_config)
user = UserProxyAgent(name="user", human_input_mode="NEVER")

for i in range(50):
    user.reset()
    assistant.reset()

    user.initiate_chat(
        assistant,
        message=f"Analyze transaction batch {i}. Return only anomalies and counts."
    )

If you need continuity, don’t keep the entire transcript. Keep a compact summary outside the agent and inject only that summary into the next turn.

Other Possible Causes

1) Tool output is too large

A common failure mode in AutoGen is returning raw JSON, logs, SQL dumps, or HTML from a tool call and feeding it back into the conversation.

# Too much raw output
def fetch_customer_export():
    return open("customer_export.json").read()  # huge payload

Fix it by truncating or summarizing before returning:

def fetch_customer_export():
    data = open("customer_export.json").read()
    return data[:4000]  # or summarize first

2) Nested agents are echoing each other

If you have an AssistantAgent calling another AssistantAgent, both may preserve their own histories. That multiplies token usage fast.

# Example of compounding history across nested agents
planner = AssistantAgent(name="planner", llm_config=llm_config)
writer = AssistantAgent(name="writer", llm_config=llm_config)
reviewer = AssistantAgent(name="reviewer", llm_config=llm_config)

Use explicit handoff boundaries and clear histories between stages:

planner.reset()
writer.reset()
reviewer.reset()

3) Your system prompt is bloated

I’ve seen teams stuff policy docs, SOPs, FAQ pages, and schema dumps into system_message. That burns tokens before the first user message.

assistant = AssistantAgent(
    name="assistant",
    llm_config=llm_config,
    system_message=open("all_company_policies.txt").read()
)

Keep the system prompt short and move reference material into retrieval or external storage.

4) Model context window is too small for your workload

If you’re using a smaller model like gpt-4o-mini, it may simply not have enough room for your current conversation size.

llm_config = {
    "config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}]
}

Switch to a larger-context model if your workflow genuinely needs more room. But treat that as a last resort; fixing prompt growth is usually cheaper and more stable.

How to Debug It

•
Print message lengths before each call
- •Log the number of messages and approximate character count.
- •If you see steady growth across iterations, you found the leak.
•
Inspect tool outputs
- •Check whether any function returns full files, stack traces, database rows, or long HTML.
- •If one tool response is massive, trim it immediately.
•
Reset agents and rerun
- •Call agent.reset() on all participants.
- •If the error disappears, your issue is accumulated state, not model choice.
•
Reduce prompt size step by step
- •Remove system instructions first.
- •Then remove tool output.
- •Then reduce history length.
- •The layer that makes the error disappear is your culprit.

A simple debug helper helps:

def debug_messages(messages):
    total_chars = sum(len(m.get("content", "")) for m in messages if isinstance(m.get("content"), str))
    print(f"messages={len(messages)} chars={total_chars}")

Prevention

•
Keep agent memory bounded.
- •Reset between jobs or use a summary-based memory strategy instead of full transcript retention.
•
Treat tool output as untrusted prompt input.
- •Truncate logs, summarize documents, and never dump raw exports into chat.
•
Use smaller prompts and shorter instructions.
- •Put policies in code comments or external docs, not giant system_message blobs.

If you’re building production AutoGen workflows for banking or insurance, assume every message will be repeated multiple times across agents. Once you design for bounded context from day one, this error stops showing up at scale.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit