How to Fix 'deployment crash when scaling' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-21

deployment-crash-when-scalingllamaindextypescript

When you see deployment crash when scaling in a LlamaIndex TypeScript app, it usually means your process is fine on one instance but falls over once the platform starts running multiple replicas or restarts containers. In practice, this shows up when your index, retriever, or chat engine depends on local memory, ephemeral files, or startup-time network calls that don’t survive horizontal scaling.

The symptom is rarely “LlamaIndex is broken.” It’s usually a deployment shape problem: state lives in the wrong place, initialization happens at the wrong time, or each replica is loading something expensive and failing under pressure.

The Most Common Cause

The #1 cause is building your index in memory during app startup and expecting every scaled instance to share it.

That works locally. It breaks in Kubernetes, ECS, Cloud Run, or any setup where one pod/container can die and another starts with an empty heap.

Here’s the broken pattern:

Broken	Fixed
Build `VectorStoreIndex` from scratch on boot	Persist storage and reload it
Keep `Document[]` only in process memory	Store documents in durable storage
Recreate embeddings on every replica start	Initialize once, reuse across instances

// BROKEN: index exists only in memory
import { Document, VectorStoreIndex } from "llamaindex";

const docs = [
  new Document({ text: "Claims policy for premium customers..." }),
  new Document({ text: "Underwriting rules for SME accounts..." }),
];

const index = await VectorStoreIndex.fromDocuments(docs);

export async function handler(req: Request) {
  const queryEngine = index.asQueryEngine();
  const response = await queryEngine.query({
    query: "What is the claims policy?",
  });

  return Response.json({ answer: response.toString() });
}

// FIXED: persist and reload storage
import {
  Document,
  StorageContext,
  VectorStoreIndex,
} from "llamaindex";
import { SimpleVectorStore } from "llamaindex/storage/vectorStore/SimpleVectorStore";
import { SimpleDocumentStore } from "llamaindex/storage/docStore/SimpleDocumentStore";
import { SimpleIndexStore } from "llamaindex/storage/indexStore/SimpleIndexStore";

export async function buildOrLoadIndex() {
  const storageContext = await StorageContext.fromDefaults({
    vectorStore: new SimpleVectorStore(),
    docStore: new SimpleDocumentStore(),
    indexStore: new SimpleIndexStore(),
  });

  // Load persisted data here if available
  return await VectorStoreIndex.init({
    storageContext,
  });
}

let indexPromise = buildOrLoadIndex();

export async function handler(req: Request) {
  const index = await indexPromise;
  const queryEngine = index.asQueryEngine();
  const response = await queryEngine.query({
    query: "What is the claims policy?",
  });

  return Response.json({ answer: response.toString() });
}

If you’re using a real vector database like Pinecone, Qdrant, Weaviate, or Postgres/pgvector, the fix is the same idea: do not treat local process memory as your source of truth.

Other Possible Causes

1) You are creating too many clients per request

This often shows up as connection churn during scale-out. The process looks healthy until traffic increases and then you get timeouts or crashes like ECONNRESET, ETIMEDOUT, or upstream failures from your embedding provider.

// BAD
export async function handler(req: Request) {
  const llm = new OpenAI({ model: "gpt-4o-mini" });
  const embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small" });
  // created on every request
}

// GOOD
const llm = new OpenAI({ model: "gpt-4o-mini" });
const embedModel = new OpenAIEmbedding({ model: "text-embedding-3-small" });

export async function handler(req: Request) {
  // reuse shared clients
}

2) Your runtime memory limit is too low

A common failure looks like:

•FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
•container exit code 137
•pod restart loops during warmup

If you’re loading large documents into VectorStoreIndex.fromDocuments(...), increase memory or move ingestion offline.

resources:
  limits:
    memory: "2Gi"

3) You are doing ingestion inside the web request path

This kills scale because every replica tries to chunk documents, generate embeddings, and write vectors at request time.

// BAD
app.post("/chat", async (req) => {
  const docs = await loadDocsFromS3();
  const index = await VectorStoreIndex.fromDocuments(docs);
});

Move this into a background job or deployment step:

// GOOD
await ingestDocumentsOnce();
await persistIndexToVectorDB();

4) Your environment variables differ between replicas

One pod has OPENAI_API_KEY, another doesn’t. Then you get errors like:

•OpenAI API error: Incorrect API key provided
•Error fetching embeddings
•AuthenticationError

Check that secrets are mounted consistently across all deployments.

if (!process.env.OPENAI_API_KEY) {
  throw new Error("OPENAI_API_KEY missing");
}

How to Debug It

•
Check whether the crash happens on startup or first request.
Startup crashes usually point to bad initialization or missing secrets. First-request crashes usually point to lazy loading, network calls, or per-request ingestion.
•
Look at the exact error string in logs.
If you see heap out of memory, it’s capacity. If you see AuthenticationError or 401, it’s config. If you see repeated rebuilds of VectorStoreIndex.fromDocuments, it’s architecture.
•
Log which replica is handling the request.
Add pod/container identifiers so you can tell whether one instance fails while others work.

console.log({
  podName: process.env.HOSTNAME,
  nodeEnv: process.env.NODE_ENV,
});

•Temporarily disable ingestion and use a prebuilt index.
If the crash disappears when you stop building indexes at runtime, you’ve found the problem class immediately.

Prevention

•Build indexes offline and persist them to a real vector store or durable storage.
•Reuse LLM and embedding clients at module scope instead of recreating them per request.
•Add startup checks for required env vars and fail fast before traffic hits the service.

If your app only works with one replica, assume state leakage until proven otherwise. In LlamaIndex TypeScript apps, scaling bugs are usually not indexing bugs — they’re deployment bugs wearing indexing clothes.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit