How to Fix 'intermittent 500 errors when scaling' in LlamaIndex (TypeScript)
Intermittent 500 errors when scaling in LlamaIndex TypeScript usually means your app works under light load, then starts failing once concurrent requests, rate limits, or shared state kick in. In practice, this shows up when you move from a single local request to multiple parallel queries, background ingestion jobs, or serverless traffic spikes.
The key detail: 500 is usually the symptom, not the root cause. In LlamaIndex apps, the real issue is often an unhandled upstream failure from OpenAI/Azure/OpenRouter, a reused client with bad concurrency settings, or shared mutable state inside your query pipeline.
The Most Common Cause
The #1 cause is creating one shared index/query engine and hammering it with concurrent requests without controlling concurrency or isolating per-request state.
In TypeScript, this often looks fine in development and fails under load with errors like:
- •
Error: 500 Internal Server Error - •
OpenAI API error: Rate limit reached - •
Failed to execute query engine - •
TypeError: Cannot read properties of undefined - •
AbortError: The operation was aborted
Here’s the broken pattern versus the fixed one.
| Broken pattern | Fixed pattern |
|---|---|
| Reuse one mutable query engine for all requests | Keep the index singleton, but create per-request query execution with throttling |
| Fire unlimited parallel calls | Limit concurrency |
| Ignore upstream retries/timeouts | Add retry/backoff and request timeouts |
// ❌ Broken: unlimited parallel calls against one shared engine
import { Settings, VectorStoreIndex } from "llamaindex";
const index = await VectorStoreIndex.fromDocuments(docs);
const queryEngine = index.asQueryEngine();
export async function handler(req: Request) {
const prompts = await req.json();
// This explodes under load if many requests hit at once
const results = await Promise.all(
prompts.map((prompt: string) => queryEngine.query({ query: prompt }))
);
return Response.json(results);
}
// ✅ Fixed: throttle concurrency and isolate request handling
import pLimit from "p-limit";
import { VectorStoreIndex } from "llamaindex";
const limit = pLimit(4);
const index = await VectorStoreIndex.fromDocuments(docs);
export async function handler(req: Request) {
const prompts = await req.json();
const results = await Promise.all(
prompts.map((prompt: string) =>
limit(async () => {
const queryEngine = index.asQueryEngine();
return await queryEngine.query({ query: prompt });
})
)
);
return Response.json(results);
}
Why this matters:
- •
VectorStoreIndexcan be shared. - •
QueryEngineusage should be treated as request-scoped in high-concurrency paths. - •If your model provider rate-limits or times out, parallel bursts turn into intermittent
500s fast.
Other Possible Causes
1. Provider rate limits masked as server errors
A lot of teams see a generic 500 from their API route while the actual failure came from OpenAI or another provider.
Typical underlying message:
- •
429 Too Many Requests - •
Rate limit reached for gpt-4o - •
OpenAI API error
try {
return await queryEngine.query({ query });
} catch (err) {
console.error("LLM call failed:", err);
throw err; // don't swallow the real provider error
}
If you wrap everything in a generic catch and return 500, you lose the real signal.
2. Shared global Settings.llm / Settings.embedModel mutated at runtime
In LlamaIndex TypeScript, global settings are convenient, but mutating them per request is a bad idea in multi-user services.
// ❌ Bad: changing global settings per request
import { Settings } from "llamaindex";
export async function handler(req: Request) {
Settings.llm = myTenantSpecificLLM;
Settings.embedModel = myTenantSpecificEmbedModel;
// concurrent requests can now stomp each other
}
Use immutable initialization at process startup instead.
// ✅ Good: configure once during boot
import { Settings } from "llamaindex";
Settings.llm = defaultLLM;
Settings.embedModel = defaultEmbedModel;
If you need tenant-specific models, build isolated service instances instead of mutating globals.
3. Serverless cold starts + oversized indexes
If you build the index on every invocation or load huge documents into memory during request handling, scaling will surface timeouts and intermittent failures.
// ❌ Bad: rebuilding the index on every request
export async function handler(req: Request) {
const docs = await loadDocs();
const index = await VectorStoreIndex.fromDocuments(docs);
const engine = index.asQueryEngine();
return Response.json(await engine.query({ query: "..." }));
}
Move ingestion out of the hot path:
// ✅ Good: build once, reuse persisted storage
const storageContext = await StorageContext.fromDefaults({
vectorStore,
});
const index = await VectorStoreIndex.init({
storageContext,
});
Persist your store and reload it in the runtime instead of reconstructing everything per request.
4. Missing timeout/retry policy on upstream HTTP calls
Under scale, transient network failures become visible. Without retry/backoff, they bubble up as random 500s.
// Example wrapper around LLM/query calls
async function withRetry<T>(fn: () => Promise<T>, retries = 3) {
let lastErr;
for (let i = 0; i < retries; i++) {
try {
return await fn();
} catch (err) {
lastErr = err;
await new Promise(r => setTimeout(r, Math.pow(2, i) * 200));
}
}
throw lastErr;
}
Use this around provider calls where transient failures are expected.
How to Debug It
- •
Log the full underlying error
- •Don’t stop at
500. - •Capture stack traces and provider payloads.
- •Look for messages like
429,ECONNRESET,AbortError, or SDK-specific failures.
- •Don’t stop at
- •
Disable concurrency temporarily
- •Replace
Promise.all(...)with sequential execution. - •If errors disappear, you’re dealing with contention or rate limiting.
- •That points straight at throttling or shared-state bugs.
- •Replace
- •
Check whether you mutate global LlamaIndex settings
- •Search for assignments to:
- •
Settings.llm - •
Settings.embedModel - •other process-wide config
- •
- •If they change inside handlers, that’s a red flag.
- •Search for assignments to:
- •
Separate ingestion from querying
- •If your route builds indexes during traffic, move that work to a job/cron.
- •Query endpoints should only read persisted indexes and execute retrieval.
Prevention
- •Initialize LlamaIndex config once at startup.
- •Treat query execution as request-scoped and add concurrency limits.
- •Persist indexes and vector stores; don’t rebuild them in hot paths.
- •Add structured logging around:
- •provider status codes
- •latency
- •retry counts
- •token usage
If you’re seeing intermittent 500s only after scaling traffic up, assume concurrency first. In LlamaIndex TypeScript apps, that’s usually where the bug lives.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit