How to Fix 'token limit exceeded in production' in AutoGen (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceeded-in-productionautogentypescript

Token limit exceeded in production usually means your AutoGen agent is sending more context to the model than the model can accept. In TypeScript, this tends to show up after a few turns, when you keep appending messages, tool outputs, or retrieved documents into the same conversation loop.

The failure is usually not the model itself. It’s your message assembly, memory growth, or tool output strategy.

The Most Common Cause

The #1 cause is unbounded chat history. You keep passing the full messages array back into AssistantAgent, so every turn includes all prior turns plus tool output plus summaries.

A typical error looks like this:

  • Error: token limit exceeded
  • 400 Bad Request: This model's maximum context length is 128000 tokens
  • OpenAIError: Request too large for gpt-4o
  • In AutoGen flows, you may also see failures during run() or generateReply() on AssistantAgent

Broken vs fixed pattern

BrokenFixed
Reuses full history foreverTrims history or uses bounded memory
Appends raw tool output directlySummarizes or extracts only needed fields
No token budgetingEnforces max context per turn
// BROKEN: history grows without bound
import { AssistantAgent } from "@autogen/core";

const agent = new AssistantAgent({
  name: "support-agent",
  modelClient,
});

const messages: any[] = [];

for (const userInput of inputs) {
  messages.push({ role: "user", content: userInput });

  const result = await agent.run(messages);
  messages.push({ role: "assistant", content: result.messages.at(-1)?.content });
}
// FIXED: keep a bounded window and summarize older turns
import { AssistantAgent } from "@autogen/core";

const agent = new AssistantAgent({
  name: "support-agent",
  modelClient,
});

let messages: any[] = [];
let summary = "";

for (const userInput of inputs) {
  messages.push({ role: "user", content: userInput });

  // keep only recent turns
  const recentMessages = messages.slice(-8);

  const result = await agent.run([
    ...(summary ? [{ role: "system", content: `Conversation summary:\n${summary}` }] : []),
    ...recentMessages,
  ]);

  const last = result.messages.at(-1);
  if (last?.content) {
    messages.push({ role: "assistant", content: last.content });
  }

  // periodically refresh summary from older turns
  if (messages.length > 20) {
    summary = await summarizeOldTurns(messages.slice(0, -8));
    messages = messages.slice(-8);
  }
}

If you are using RoundRobinGroupChat, SelectorGroupChat, or another multi-agent pattern, the same issue applies. Each agent contributes to the shared transcript, and that transcript gets replayed into future calls.

Other Possible Causes

1. Tool output is too large

This happens when a function returns full JSON payloads, PDFs converted to text, logs, or entire database rows.

// BAD
return JSON.stringify(orderResponse); // could be huge

// BETTER
return JSON.stringify({
  orderId: orderResponse.orderId,
  status: orderResponse.status,
  updatedAt: orderResponse.updatedAt,
});

If your tool returns a list, return only top results or IDs. Don’t feed the model a full export unless you’ve chunked it.

2. Retrieval injects too many documents

RAG pipelines often stuff too many chunks into the prompt. Three chunks might be fine; twenty chunks with long metadata will break production quickly.

const docs = await vectorStore.search(query, { k: 20 }); // risky

const context = docs
  .slice(0, 4)
  .map((d) => d.pageContent)
  .join("\n\n");

Keep an eye on chunk size too. Four massive chunks can be worse than ten small ones.

3. System prompt bloat

Teams often keep adding policy text, examples, escalation rules, and formatting instructions until the system prompt alone eats half the budget.

// BAD: giant system prompt with repeated policy text
const systemPrompt = fs.readFileSync("./prompts/support-full.txt", "utf8");

// BETTER: split policy from task instructions and keep only what matters per route
const systemPrompt = `
You are a claims support assistant.
Use only approved policy data.
Ask for missing claim number before proceeding.
`;

If you need long policy content, retrieve it on demand instead of hardcoding it into every request.

4. Model context mismatch

Sometimes the code is fine but the deployed model has a smaller context window than your local setup. This shows up when staging works and production fails after a model swap.

const modelClient = new OpenAIChatCompletionClient({
  model: "gpt-4o-mini", // smaller budget than you expected in some deployments
});

Check what model your prod environment actually uses. Also verify whether your provider wrapper enforces a lower max input size than the base model.

How to Debug It

  1. Log token usage per turn

    • Print message count and approximate token count before each call.
    • If you see steady growth across turns, you found the leak.
  2. Inspect what gets sent to run()

    • Log the final payload right before calling AssistantAgent.run().
    • Look for repeated summaries, duplicated tool outputs, or giant arrays.
  3. Isolate tools

    • Disable tools one by one.
    • If the error disappears after removing one function, that tool is returning too much data.
  4. Check your deployed model and limits

    • Confirm the exact model name in production config.
    • Compare its context window against your longest real request.

A practical way to instrument this in TypeScript:

function roughTokenEstimate(text: string) {
  return Math.ceil(text.length / 4);
}

function estimateMessages(messages: { role: string; content?: string }[]) {
  return messages.reduce((sum, m) => sum + roughTokenEstimate(m.content ?? ""), 0);
}

console.log("Estimated tokens:", estimateMessages(messages));
console.log("Message count:", messages.length);

This is not exact tokenization, but it’s good enough to catch runaway growth fast.

Prevention

  • Keep a fixed-size conversation window and summarize older turns.
  • Cap tool outputs and RAG context before passing them into AutoGen.
  • Add preflight token checks in CI or before every production request.

If you want one rule to remember: never let raw history grow forever inside an agent loop. AutoGen will do exactly what you ask it to do, and if you keep replaying everything, production will eventually hit the wall with token limit exceeded.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides