How to Fix 'intermittent 500 errors in production' in AutoGen (TypeScript)
Intermittent 500 errors in AutoGen TypeScript usually mean your agent graph is fine locally, but one of the runtime edges is failing under real traffic. In practice, this shows up when an agent call sometimes returns a server-side exception from your app, not from OpenAI directly.
The pattern is usually: requests work in dev, then production traffic introduces concurrency, missing config, bad tool output, or timeouts. With AutoGen, the failure often bubbles up as something like Internal Server Error, AgentRuntimeError, or a plain 500 from your API route.
The Most Common Cause
The #1 cause I see is shared mutable state inside an agent/tool handler. In TypeScript, people often reuse one AssistantAgent, one runtime, or one in-memory session object across concurrent requests, then mutate it per request.
That works until two requests overlap and one request corrupts the other’s context. The result is intermittent failures that look random.
| Broken pattern | Fixed pattern |
|---|---|
| Reuse a singleton agent/runtime and mutate request-specific state | Create per-request state or isolate sessions by conversation ID |
// ❌ Broken: shared mutable state
import { AssistantAgent } from "@autogen/core";
const agent = new AssistantAgent({
name: "support-agent",
modelClient,
});
let currentUserId: string | undefined;
export async function POST(req: Request) {
const body = await req.json();
currentUserId = body.userId;
// This tool reads shared state that can be overwritten by another request.
agent.registerTool("lookupAccount", async () => {
return db.accounts.findByUserId(currentUserId!);
});
const result = await agent.run(body.message);
return Response.json(result);
}
// ✅ Fixed: request-scoped state
import { AssistantAgent } from "@autogen/core";
export async function POST(req: Request) {
const body = await req.json();
const userId = body.userId;
const agent = new AssistantAgent({
name: "support-agent",
modelClient,
});
agent.registerTool("lookupAccount", async () => {
return db.accounts.findByUserId(userId);
});
const result = await agent.run(body.message);
return Response.json(result);
}
If you need a long-lived agent, keep it stateless and move request data into the message payload or a session store keyed by conversation ID. Don’t let one request overwrite another request’s execution context.
Other Possible Causes
1) Tool exceptions are escaping uncaught
A tool throws, AutoGen wraps it poorly, and your API returns a generic 500.
agent.registerTool("getPolicy", async ({ policyId }) => {
const policy = await db.policies.findUnique({ where: { id: policyId } });
if (!policy) throw new Error(`Policy not found: ${policyId}`);
return policy;
});
Fix it by returning structured errors or catching and converting to safe tool output.
agent.registerTool("getPolicy", async ({ policyId }) => {
try {
const policy = await db.policies.findUnique({ where: { id: policyId } });
if (!policy) return { ok: false, error: "POLICY_NOT_FOUND" };
return { ok: true, policy };
} catch (err) {
return { ok: false, error: "DB_LOOKUP_FAILED" };
}
});
2) Timeout mismatch between AutoGen and your server
Your serverless function times out before the model call finishes. This often appears as intermittent because only slower prompts hit the limit.
export const maxDuration = 10; // too low for multi-step agent runs
Raise the timeout or reduce steps:
export const maxDuration = 60;
If you’re using a custom fetch client, also set explicit timeouts:
const controller = new AbortController();
setTimeout(() => controller.abort(), 55000);
await modelClient.create({
messages,
signal: controller.signal,
});
3) Invalid message/tool schema
AutoGen TypeScript is strict about message shapes. A malformed tool result can blow up at runtime with errors like ValidationError or Unexpected tool response format.
// ❌ Wrong shape
return { data: "ok" };
Use the exact structure your agent expects:
// ✅ Consistent structured output
return { ok: true, data: "ok" };
Also verify any JSON schema passed to tools matches the runtime payload exactly.
4) Missing environment variables in production
This one causes classic “works locally” behavior. Your local shell has OPENAI_API_KEY, but production does not.
const apiKey = process.env.OPENAI_API_KEY!;
That non-null assertion hides the problem until runtime. Use explicit validation at startup:
const apiKey = process.env.OPENAI_API_KEY;
if (!apiKey) throw new Error("Missing OPENAI_API_KEY");
How to Debug It
- •
Check whether the failure happens before or after the model call
- •Add logs around every boundary:
console.log("before tool"); console.log("before model call"); console.log("after model call");If it fails before the model call, it’s usually your app code or tool setup.
- •
Wrap each tool in its own try/catch
- •Log the full error and input payload.
- •Look for
TypeError, DB errors, or serialization issues.
try { return await riskyTool(input); } catch (err) { console.error("tool failed", { input, err }); throw err; } - •
Run with one request at a time
- •Disable concurrency in your load test.
- •If the error disappears, you likely have shared mutable state or race conditions.
- •
Inspect production logs for exact class names
- •Search for:
- •
AgentRuntimeError - •
ValidationError - •
AbortError - •
Internal Server Error
- •
- •The class name usually tells you whether this is timeout, schema mismatch, or uncaught tool failure.
- •Search for:
Prevention
- •Keep agents stateless; store conversation state outside the agent instance.
- •Validate env vars and tool inputs at startup instead of failing mid-request.
- •Put every external dependency behind retries and bounded timeouts:
- •DB calls
- •HTTP tools
- •file/network access
If you’re seeing intermittent 500s in AutoGen TypeScript, start by removing shared state from your agent and tools. That fixes more production incidents than any other change I’ve seen in this stack.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit