How to Fix 'context length exceeded in production' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-21

context-length-exceeded-in-productionautogenpython

Opening

context length exceeded in AutoGen means the model received more tokens than its context window can hold. In production, this usually shows up after a few turns of agent-to-agent chat, tool output dumps, or when you keep appending full transcripts into every new request.

The failure is usually not in the model itself. It’s in how you’re building messages, carrying state, or letting AutoGen replay too much history.

The Most Common Cause

The #1 cause is unbounded chat history being passed back into the next AssistantAgent call. With AutoGen, this often happens when you keep using the same messages list, or when an agent conversation keeps accumulating tool output and prior turns until the next OpenAI request throws something like:

•openai.BadRequestError: Error code: 400 - {'error': {'message': 'This model's maximum context length is ...'}}
•context_length_exceeded
•InvalidRequestError: This model's maximum context length is exceeded

Here’s the broken pattern:

Broken	Fixed
Reuses full history forever	Trims or summarizes before each call

# BROKEN: unbounded message growth
from autogen import AssistantAgent, UserProxyAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config={"config_list": [{"model": "gpt-4o-mini"}]}
)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER"
)

messages = []

for i in range(100):
    messages.append({"role": "user", "content": f"Turn {i}: analyze this claim file..."})
    reply = assistant.generate_reply(messages=messages)
    messages.append({"role": "assistant", "content": reply})

# FIXED: trim or summarize history before sending it back
from autogen import AssistantAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config={"config_list": [{"model": "gpt-4o-mini"}]}
)

def build_context(latest_user_message: str, summary: str = ""):
    messages = []
    if summary:
        messages.append({
            "role": "system",
            "content": f"Conversation summary so far:\n{summary}"
        })
    messages.append({"role": "user", "content": latest_user_message})
    return messages

summary = ""
for i in range(100):
    msgs = build_context(f"Turn {i}: analyze this claim file...", summary)
    reply = assistant.generate_reply(messages=msgs)

    # Update summary separately; don't keep appending raw transcript forever.
    summary = f"Latest turn {i}: {reply[:500]}"

If you’re using GroupChat, the same issue appears faster because multiple agents add their own verbose responses. A single long tool result can push the whole conversation over the limit.

Other Possible Causes

1) Tool output is too large

A common production bug is returning entire files, logs, SQL dumps, or API payloads from tools and feeding them straight back to the model.

# BAD: returning huge raw payloads
def fetch_claim_history(claim_id: str):
    return open(f"/data/claims/{claim_id}.json").read()

Fix it by truncating and extracting only what matters.

# GOOD: return a compact summary
def fetch_claim_history(claim_id: str):
    raw = open(f"/data/claims/{claim_id}.json").read()
    return raw[:4000]  # better: parse and summarize specific fields

2) You are using a small-context model for a long workflow

Some teams run long multi-agent workflows on models with smaller windows and expect them to hold everything. That works for short chats, not for compliance review loops or claims triage chains.

llm_config = {
    "config_list": [
        {"model": "gpt-4o-mini"}  # may be too small for long transcripts
    ]
}

Switch to a larger context model when the task needs it.

llm_config = {
    "config_list": [
        {"model": "gpt-4.1"}  # larger context window
    ]
}

3) `reflect_on_tool_use=True` is amplifying tokens

In AutoGen, reflection can cause the assistant to re-read tool outputs and generate another verbose reasoning pass. That’s useful, but expensive.

assistant = AssistantAgent(
    name="assistant",
    llm_config={"config_list": [{"model": "gpt-4o-mini"}]},
    reflect_on_tool_use=True
)

If your tool outputs are already large, disable reflection or make tool results smaller.

assistant = AssistantAgent(
    name="assistant",
    llm_config={"config_list": [{"model": "gpt-4o-mini"}]},
    reflect_on_tool_use=False
)

4) Your system prompt is bloated

I’ve seen production prompts with policy text, SOPs, examples, schema docs, and JSON blobs all stuffed into one system message. That eats context before the user even says anything.

system_prompt = """
You are a claims assistant.
[12 pages of policy text]
[40-line schema]
[example conversations]
[full API contract]
"""

Keep system prompts tight and move reference material into retrieval or tools.

system_prompt = """
You are a claims assistant.
Use tools for policy lookup.
Return concise answers with citations.
"""

How to Debug It

•
Measure token growth per turn
- •Log message count and approximate token count before every generate_reply() call.
- •If it grows monotonically without bound, you found your problem.
•
Inspect tool outputs
- •Print the size of every tool result.
- •If one function returns megabytes of JSON or logs, truncate it before sending it to the agent.
•
Check which model is actually configured
- •In AutoGen deployments with multiple configs, confirm the active model in llm_config["config_list"].
- •A fallback to a smaller-context model can make an otherwise stable workflow fail in production.
•
Disable features one at a time
- •Turn off reflect_on_tool_use.
- •Reduce max_consecutive_auto_reply.
- •Remove long system prompts.
- •If the error disappears after one change, that component was contributing to token bloat.

Prevention

•Use summarization checkpoints after every N turns instead of replaying full transcripts.
•Keep tool outputs structured and short; return IDs, summaries, or top results instead of raw dumps.
•Set hard limits on conversation length in production workflows and rotate state into storage when needed.

If you want this stable under load, treat context like memory management. AutoGen will happily keep talking until your prompt budget runs out; your job is to stop it before OpenAI does.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit