How to Fix 'token limit exceeded in production' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceeded-in-productionautogenpython

What the error means

token limit exceeded in production usually means your AutoGen agent is sending too much conversation history, tool output, or retrieved context into an LLM call. In practice, this shows up after a few turns in a group chat, long-running assistant session, or when a tool dumps a large payload into the message stream.

The failure is rarely in the model itself. It’s almost always in how you’re building messages, how AutoGen is retaining state, or how much text your tools are returning.

The Most Common Cause

The #1 cause is unbounded chat history. In AutoGen, AssistantAgent, UserProxyAgent, and GroupChatManager can keep appending messages until the prompt exceeds the model’s context window.

Here’s the broken pattern:

from autogen import AssistantAgent, UserProxyAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config={"config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}]},
)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
)

# Broken: reusing the same conversation forever
for i in range(100):
    user.initiate_chat(assistant, message=f"Turn {i}: summarize this report again")

And here’s the fixed pattern:

from autogen import AssistantAgent, UserProxyAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config={
        "config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}],
        "max_tokens": 800,
    },
)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
)

# Fixed: reset or bound history between runs
for i in range(100):
    user.reset()
    assistant.reset()
    user.initiate_chat(assistant, message=f"Turn {i}: summarize this report again")

If you’re using GroupChat, the same issue applies. A manager that keeps every message from every agent will blow up fast if you don’t trim history.

BrokenFixed
Reuse the same agent state foreverReset state between jobs
Append every turn to historyTrim or summarize old turns
Let tool output flow back unfilteredCap tool response size

Other Possible Causes

1) Tool output is too large

A common failure mode is a tool returning raw JSON, logs, HTML, or database rows directly into the chat.

def get_customer_records():
    return huge_json_blob  # bad

def get_customer_records():
    data = huge_json_blob[:4000]  # better: truncate or summarize
    return data

If your tool returns megabytes of text, AutoGen will happily pass that back into the next LLM call until you hit:

openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is ...'}}

2) Retrieval returns too many chunks

If you’re using RAG with RetrieveUserProxyAgent, over-retrieval can flood the prompt.

from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent

rag = RetrieveUserProxyAgent(
    name="rag",
    retrieve_config={
        "task": "qa",
        "docs_path": "./docs",
        "chunk_token_size": 1200,
        "top_k": 20,   # too high for long docs
    },
)

Reduce top_k, shrink chunk size, and filter by relevance before injecting context.

3) System prompt is bloated

I see this a lot in production code: someone pastes policy docs, schemas, examples, and runbooks into the system message.

assistant = AssistantAgent(
    name="assistant",
    system_message=open("everything.md").read(),  # bad if file is huge
)

Keep system prompts tight. Put stable rules there and move large reference material to retrieval or external lookup.

4) Model context window mismatch

Sometimes your app assumes one model but production points to another with a smaller context window.

llm_config = {
    "config_list": [
        {"model": "gpt-4o-mini", "api_key": "..."},
        {"model": "gpt-3.5-turbo", "api_key": "..."}  # smaller window if selected
    ]
}

If your fallback model has less context than your primary model, you’ll only see failures when routing switches under load.

How to Debug It

  1. Print token usage per turn

    • Log prompt size before each LLM call.
    • If you’re using OpenAI-compatible APIs, capture request payload length and response metadata.
    • Watch for growth after each round trip.
  2. Inspect what AutoGen is actually sending

    • Dump messages before the call.
    • Look for repeated assistant replies, duplicated tool outputs, or giant retrieved chunks.
    • The problem is usually obvious once you see raw history.
  3. Isolate one cause at a time

    • Disable tools first.
    • Disable retrieval next.
    • Then test with a single short system prompt.
    • If the error disappears, re-enable features one by one until it returns.
  4. Check model limits in production config

    • Verify the exact deployed model name.
    • Confirm fallback models and proxy routes.
    • Make sure your token budget matches the smallest model in the chain.

Prevention

  • Keep conversation state bounded:

    • reset agents between jobs
    • summarize older turns
    • trim stale messages before each call
  • Put hard limits on tools:

    • truncate large outputs
    • return summaries instead of raw dumps
    • cap retrieval top_k and chunk sizes
  • Add token-budget tests in CI:

    • simulate long conversations
    • assert prompt size stays under a threshold
    • fail builds when message growth becomes unbounded

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides