How to Fix 'token limit exceeded in production' in AutoGen (Python)
What the error means
token limit exceeded in production usually means your AutoGen agent is sending too much conversation history, tool output, or retrieved context into an LLM call. In practice, this shows up after a few turns in a group chat, long-running assistant session, or when a tool dumps a large payload into the message stream.
The failure is rarely in the model itself. It’s almost always in how you’re building messages, how AutoGen is retaining state, or how much text your tools are returning.
The Most Common Cause
The #1 cause is unbounded chat history. In AutoGen, AssistantAgent, UserProxyAgent, and GroupChatManager can keep appending messages until the prompt exceeds the model’s context window.
Here’s the broken pattern:
from autogen import AssistantAgent, UserProxyAgent
assistant = AssistantAgent(
name="assistant",
llm_config={"config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}]},
)
user = UserProxyAgent(
name="user",
human_input_mode="NEVER",
)
# Broken: reusing the same conversation forever
for i in range(100):
user.initiate_chat(assistant, message=f"Turn {i}: summarize this report again")
And here’s the fixed pattern:
from autogen import AssistantAgent, UserProxyAgent
assistant = AssistantAgent(
name="assistant",
llm_config={
"config_list": [{"model": "gpt-4o-mini", "api_key": "YOUR_KEY"}],
"max_tokens": 800,
},
)
user = UserProxyAgent(
name="user",
human_input_mode="NEVER",
)
# Fixed: reset or bound history between runs
for i in range(100):
user.reset()
assistant.reset()
user.initiate_chat(assistant, message=f"Turn {i}: summarize this report again")
If you’re using GroupChat, the same issue applies. A manager that keeps every message from every agent will blow up fast if you don’t trim history.
| Broken | Fixed |
|---|---|
| Reuse the same agent state forever | Reset state between jobs |
| Append every turn to history | Trim or summarize old turns |
| Let tool output flow back unfiltered | Cap tool response size |
Other Possible Causes
1) Tool output is too large
A common failure mode is a tool returning raw JSON, logs, HTML, or database rows directly into the chat.
def get_customer_records():
return huge_json_blob # bad
def get_customer_records():
data = huge_json_blob[:4000] # better: truncate or summarize
return data
If your tool returns megabytes of text, AutoGen will happily pass that back into the next LLM call until you hit:
openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is ...'}}
2) Retrieval returns too many chunks
If you’re using RAG with RetrieveUserProxyAgent, over-retrieval can flood the prompt.
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
rag = RetrieveUserProxyAgent(
name="rag",
retrieve_config={
"task": "qa",
"docs_path": "./docs",
"chunk_token_size": 1200,
"top_k": 20, # too high for long docs
},
)
Reduce top_k, shrink chunk size, and filter by relevance before injecting context.
3) System prompt is bloated
I see this a lot in production code: someone pastes policy docs, schemas, examples, and runbooks into the system message.
assistant = AssistantAgent(
name="assistant",
system_message=open("everything.md").read(), # bad if file is huge
)
Keep system prompts tight. Put stable rules there and move large reference material to retrieval or external lookup.
4) Model context window mismatch
Sometimes your app assumes one model but production points to another with a smaller context window.
llm_config = {
"config_list": [
{"model": "gpt-4o-mini", "api_key": "..."},
{"model": "gpt-3.5-turbo", "api_key": "..."} # smaller window if selected
]
}
If your fallback model has less context than your primary model, you’ll only see failures when routing switches under load.
How to Debug It
- •
Print token usage per turn
- •Log prompt size before each LLM call.
- •If you’re using OpenAI-compatible APIs, capture request payload length and response metadata.
- •Watch for growth after each round trip.
- •
Inspect what AutoGen is actually sending
- •Dump
messagesbefore the call. - •Look for repeated assistant replies, duplicated tool outputs, or giant retrieved chunks.
- •The problem is usually obvious once you see raw history.
- •Dump
- •
Isolate one cause at a time
- •Disable tools first.
- •Disable retrieval next.
- •Then test with a single short system prompt.
- •If the error disappears, re-enable features one by one until it returns.
- •
Check model limits in production config
- •Verify the exact deployed model name.
- •Confirm fallback models and proxy routes.
- •Make sure your token budget matches the smallest model in the chain.
Prevention
- •
Keep conversation state bounded:
- •reset agents between jobs
- •summarize older turns
- •trim stale messages before each call
- •
Put hard limits on tools:
- •truncate large outputs
- •return summaries instead of raw dumps
- •cap retrieval
top_kand chunk sizes
- •
Add token-budget tests in CI:
- •simulate long conversations
- •assert prompt size stays under a threshold
- •fail builds when message growth becomes unbounded
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit