How to Fix 'intermittent 500 errors in production' in AutoGen (TypeScript)

By Cyprian AaronsUpdated 2026-04-21

intermittent-500-errors-in-productionautogentypescript

Intermittent 500 errors in AutoGen TypeScript usually mean your agent graph is fine locally, but one of the runtime edges is failing under real traffic. In practice, this shows up when an agent call sometimes returns a server-side exception from your app, not from OpenAI directly.

The pattern is usually: requests work in dev, then production traffic introduces concurrency, missing config, bad tool output, or timeouts. With AutoGen, the failure often bubbles up as something like Internal Server Error, AgentRuntimeError, or a plain 500 from your API route.

The Most Common Cause

The #1 cause I see is shared mutable state inside an agent/tool handler. In TypeScript, people often reuse one AssistantAgent, one runtime, or one in-memory session object across concurrent requests, then mutate it per request.

That works until two requests overlap and one request corrupts the other’s context. The result is intermittent failures that look random.

Broken pattern	Fixed pattern
Reuse a singleton agent/runtime and mutate request-specific state	Create per-request state or isolate sessions by conversation ID

// ❌ Broken: shared mutable state
import { AssistantAgent } from "@autogen/core";

const agent = new AssistantAgent({
  name: "support-agent",
  modelClient,
});

let currentUserId: string | undefined;

export async function POST(req: Request) {
  const body = await req.json();
  currentUserId = body.userId;

  // This tool reads shared state that can be overwritten by another request.
  agent.registerTool("lookupAccount", async () => {
    return db.accounts.findByUserId(currentUserId!);
  });

  const result = await agent.run(body.message);
  return Response.json(result);
}

// ✅ Fixed: request-scoped state
import { AssistantAgent } from "@autogen/core";

export async function POST(req: Request) {
  const body = await req.json();
  const userId = body.userId;

  const agent = new AssistantAgent({
    name: "support-agent",
    modelClient,
  });

  agent.registerTool("lookupAccount", async () => {
    return db.accounts.findByUserId(userId);
  });

  const result = await agent.run(body.message);
  return Response.json(result);
}

If you need a long-lived agent, keep it stateless and move request data into the message payload or a session store keyed by conversation ID. Don’t let one request overwrite another request’s execution context.

Other Possible Causes

1) Tool exceptions are escaping uncaught

A tool throws, AutoGen wraps it poorly, and your API returns a generic 500.

agent.registerTool("getPolicy", async ({ policyId }) => {
  const policy = await db.policies.findUnique({ where: { id: policyId } });
  if (!policy) throw new Error(`Policy not found: ${policyId}`);
  return policy;
});

Fix it by returning structured errors or catching and converting to safe tool output.

agent.registerTool("getPolicy", async ({ policyId }) => {
  try {
    const policy = await db.policies.findUnique({ where: { id: policyId } });
    if (!policy) return { ok: false, error: "POLICY_NOT_FOUND" };
    return { ok: true, policy };
  } catch (err) {
    return { ok: false, error: "DB_LOOKUP_FAILED" };
  }
});

2) Timeout mismatch between AutoGen and your server

Your serverless function times out before the model call finishes. This often appears as intermittent because only slower prompts hit the limit.

export const maxDuration = 10; // too low for multi-step agent runs

Raise the timeout or reduce steps:

export const maxDuration = 60;

If you’re using a custom fetch client, also set explicit timeouts:

const controller = new AbortController();
setTimeout(() => controller.abort(), 55000);

await modelClient.create({
  messages,
  signal: controller.signal,
});

3) Invalid message/tool schema

AutoGen TypeScript is strict about message shapes. A malformed tool result can blow up at runtime with errors like ValidationError or Unexpected tool response format.

// ❌ Wrong shape
return { data: "ok" };

Use the exact structure your agent expects:

// ✅ Consistent structured output
return { ok: true, data: "ok" };

Also verify any JSON schema passed to tools matches the runtime payload exactly.

4) Missing environment variables in production

This one causes classic “works locally” behavior. Your local shell has OPENAI_API_KEY, but production does not.

const apiKey = process.env.OPENAI_API_KEY!;

That non-null assertion hides the problem until runtime. Use explicit validation at startup:

const apiKey = process.env.OPENAI_API_KEY;
if (!apiKey) throw new Error("Missing OPENAI_API_KEY");

How to Debug It

•
Check whether the failure happens before or after the model call
- •Add logs around every boundary:
```
console.log("before tool");
console.log("before model call");
console.log("after model call");
```
If it fails before the model call, it’s usually your app code or tool setup.
•
Wrap each tool in its own try/catch
- •Log the full error and input payload.
- •Look for TypeError, DB errors, or serialization issues.
```
try {
  return await riskyTool(input);
} catch (err) {
  console.error("tool failed", { input, err });
  throw err;
}
```
•
Run with one request at a time
- •Disable concurrency in your load test.
- •If the error disappears, you likely have shared mutable state or race conditions.
•
Inspect production logs for exact class names
- •
  Search for:
  - •AgentRuntimeError
  - •ValidationError
  - •AbortError
  - •Internal Server Error
- •The class name usually tells you whether this is timeout, schema mismatch, or uncaught tool failure.

Prevention

•Keep agents stateless; store conversation state outside the agent instance.
•Validate env vars and tool inputs at startup instead of failing mid-request.
•
Put every external dependency behind retries and bounded timeouts:
- •DB calls
- •HTTP tools
- •file/network access

If you’re seeing intermittent 500s in AutoGen TypeScript, start by removing shared state from your agent and tools. That fixes more production incidents than any other change I’ve seen in this stack.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit