How to Fix 'token limit exceeded during development' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-21

token-limit-exceeded-during-developmentautogenpython

What the error means

token limit exceeded during development usually means AutoGen tried to send more text to the model than the model’s context window can hold. In practice, this shows up when an agent keeps appending chat history, tool output, or long documents until the next LLM call fails.

You’ll usually hit it in multi-agent loops, long-running conversations, or when you feed raw files into AssistantAgent without trimming them first.

The Most Common Cause

The #1 cause is unbounded message accumulation in ConversableAgent / AssistantAgent. AutoGen keeps prior messages unless you explicitly trim, summarize, or reset the conversation.

Here’s the broken pattern:

from autogen import AssistantAgent, UserProxyAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config={"config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}]},
)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
)

# Broken: keeps adding long content into the same chat
for i in range(20):
    user.initiate_chat(
        assistant,
        message=f"""
        Analyze this report chunk {i}:
        {open("large_report.txt").read()}
        """
    )

And here’s the fixed pattern:

from autogen import AssistantAgent, UserProxyAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config={"config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}]},
)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
)

report_text = open("large_report.txt").read()

# Fix: send only the needed slice, and reset between runs if needed
for i in range(20):
    chunk = report_text[i * 2000 : (i + 1) * 2000]
    user.reset()
    assistant.reset()

    user.initiate_chat(
        assistant,
        message=f"Analyze this report chunk {i}:\n{chunk}"
    )

If you are using a single long conversation, don’t keep dumping full documents into every turn. Summarize earlier turns, chunk inputs, or start a fresh chat for each task.

Other Possible Causes

1. Tool output is too large

A tool that returns raw logs, HTML, JSON blobs, or entire database rows can blow up context fast.

# Bad: returning a huge payload directly
def fetch_logs():
    return open("/var/log/app.log").read()

# Better: truncate or summarize before returning
def fetch_logs():
    logs = open("/var/log/app.log").read()
    return logs[:4000]

2. `max_tokens` is too high relative to prompt size

If your prompt is already large and you request too many completion tokens, the total can exceed the model limit.

llm_config = {
    "config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}],
    "max_tokens": 4096,  # risky if input is already large
}

Fix it by reducing completion budget:

llm_config = {
    "config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}],
    "max_tokens": 800,
}

3. You are passing full files instead of extracted sections

This happens a lot with PDFs, policy docs, and incident reports.

# Bad
message = open("policy.md").read()

# Better
message = extract_relevant_section(open("policy.md").read(), heading="Claims Process")

If you need retrieval-style behavior, use chunking plus search before sending content to AutoGen.

4. Nested agent loops amplify history

A manager agent calling worker agents repeatedly can duplicate context across turns.

# Risky pattern: every worker response gets appended to manager history
groupchat = GroupChat(agents=[manager, worker1, worker2], messages=[], max_round=50)

Tune the loop count and prune history aggressively:

groupchat = GroupChat(agents=[manager, worker1, worker2], messages=[], max_round=8)

How to Debug It

•

Print message sizes before each LLM call
Check how much text you are actually sending.

total_chars = sum(len(m.get("content", "")) for m in messages)
print("chars:", total_chars)

•
Inspect whether history is growing without bound
If every turn appends more content and never resets, that’s your culprit.
•
Temporarily disable tools and external file injection
If the error disappears, the problem is probably oversized tool output or document payloads.
•
Lower max_tokens and shorten prompts
If the error only happens on longer prompts, you’re hitting model context limits rather than an AutoGen bug.

Prevention

•Reset agents between independent tasks with agent.reset() instead of reusing one giant conversation forever.
•Chunk long documents before sending them into AssistantAgent, and pass only relevant sections.
•Keep tool outputs small; return summaries, IDs, or top-N results instead of raw dumps.
•Set conservative defaults for max_tokens, especially in multi-agent workflows where history grows quickly.

If you want a stable AutoGen setup in Python, treat token budget like memory management: track what enters context, trim aggressively, and never assume the model will absorb unlimited history.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit