How to Fix 'token limit exceeded' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-21

token-limit-exceededautogenpython

If you hit token limit exceeded in AutoGen, it means the model input got too large for the context window of the LLM you configured. In practice, this usually happens after a few back-and-forth turns, when AutoGen keeps appending chat history, tool outputs, or long documents into the prompt.

The fix is usually not “use a bigger model” first. In most cases, you need to reduce what AutoGen is sending to the model, control conversation history, or trim oversized tool output.

The Most Common Cause

The #1 cause is unbounded message accumulation in ConversableAgent / AssistantAgent conversations.

AutoGen keeps prior messages unless you explicitly limit them. If you feed long docs, verbose tool output, or many turns into the same agent chat, the prompt grows until the backend throws an error like:

•openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is ...'}}
•Token limit exceeded
•context_length_exceeded

Here’s the broken pattern:

from autogen import AssistantAgent, UserProxyAgent

config_list = [
    {
        "model": "gpt-4o-mini",
        "api_key": "YOUR_KEY",
    }
]

assistant = AssistantAgent(
    name="assistant",
    llm_config={"config_list": config_list},
)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
)

# Broken: long-running chat with no truncation strategy
user.initiate_chat(
    assistant,
    message="Review this 40-page policy document and summarize it..."
)

And here’s the fixed pattern:

from autogen import AssistantAgent, UserProxyAgent

config_list = [
    {
        "model": "gpt-4o-mini",
        "api_key": "YOUR_KEY",
    }
]

llm_config = {
    "config_list": config_list,
    "max_tokens": 800,
}

assistant = AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
)

# Better: keep prompts small and split large work into chunks
for chunk in policy_chunks:
    user.initiate_chat(
        assistant,
        message=f"Summarize this section in 5 bullets:\n\n{chunk}"
    )

The important change is not just max_tokens. You also need to stop shoving one giant payload into a single conversation.

Other Possible Causes

1) Tool output is too large

If your function returns a huge string, AutoGen may inject that entire result back into the chat.

def search_documents(query: str) -> str:
    # Bad: returns massive raw text
    return open("all_results.txt").read()

Fix it by trimming at the source:

def search_documents(query: str) -> str:
    results = open("all_results.txt").read()
    return results[:4000]  # or summarize before returning

2) You are using a model with a smaller context window than you think

A common mistake is assuming all GPT models have enough room for your payload. They do not.

llm_config = {
    "config_list": [
        {"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}
    ]
}

If your workflow needs long context, use a model/config that supports it and still keep messages small. Bigger context helps, but it does not make unlimited prompts safe.

3) Nested agent chats are duplicating history

If one agent calls another agent and both carry full history forward, token usage balloons quickly.

# Bad pattern: nested chats with full transcript copied around
manager.initiate_chat(assistant_a, message=big_prompt)
manager.initiate_chat(assistant_b, message=big_prompt)

Prefer isolated tasks or pass only the minimum summary needed:

summary = "Key findings: ..."

manager.initiate_chat(assistant_b, message=summary)

4) Your system prompt is bloated

A giant system message counts too. If you stuff policies, examples, schemas, and instructions into every agent prompt, you pay for it on every turn.

assistant = AssistantAgent(
    name="assistant",
    system_message=open("huge_instructions.txt").read(),
    llm_config=llm_config,
)

Trim it down:

assistant = AssistantAgent(
    name="assistant",
    system_message="You are a claims review assistant. Return concise bullet points.",
    llm_config=llm_config,
)

How to Debug It

•
Print the actual payload size
- •Log every message going into the agent.
- •Check whether one user message or accumulated history is exploding.
•
Inspect tool outputs
- •If the error appears right after a function call, inspect what that function returned.
- •Large JSON blobs and raw search results are common offenders.
•
Check your model context window
- •Confirm which model AutoGen is actually calling.
- •Do not assume your fallback config selected the large-context model.
•
Reduce variables one by one
- •Try a single short prompt with no tools.
- •Then add history.
- •Then add tools.
- •Then add nested agents.
This isolates whether the issue is prompt size, tool output, or conversation growth.

Prevention

•
Keep tool outputs short and structured.
- •Return summaries or top-N results instead of raw dumps.
•
Split large documents into chunks before sending them to an agent.
- •Summarize per chunk first, then aggregate.
•
Set explicit limits in your AutoGen workflow.
- •Cap response size where possible and avoid carrying full transcripts across tasks.

If you want a stable AutoGen setup in production, treat token budget as a first-class constraint. The agents should work within a bounded context window by design, not by accident.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit