How to Fix 'OOM error during inference when scaling' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inference-when-scalingllamaindextypescript

When you see OOM error during inference when scaling in a LlamaIndex TypeScript app, it usually means your process ran out of memory while generating embeddings, calling an LLM, or building a large index. It tends to show up after you move from local testing to real traffic, larger documents, or concurrent requests.

In practice, this is rarely a “LlamaIndex bug.” It’s usually a batching, concurrency, or context-size problem in your own code.

The Most Common Cause

The #1 cause is loading too much data into memory at once during indexing or inference.

A common pattern is reading every file, creating every node, and embedding everything in one shot. That works for small datasets and then falls over as soon as the corpus grows.

Broken patternFixed pattern
Load all documents, split aggressively, embed everything concurrentlyProcess documents in chunks and cap concurrency
// Broken: everything happens at once
import { SimpleDirectoryReader } from "llamaindex";
import { VectorStoreIndex } from "llamaindex";

async function buildIndex() {
  const reader = new SimpleDirectoryReader("./docs");
  const docs = await reader.loadData();

  // Large corpus + default concurrency can blow memory
  const index = await VectorStoreIndex.fromDocuments(docs);
  return index;
}
// Fixed: batch documents before indexing
import { SimpleDirectoryReader, VectorStoreIndex } from "llamaindex";

async function buildIndex() {
  const reader = new SimpleDirectoryReader("./docs");
  const docs = await reader.loadData();

  const batchSize = 50;
  let index;

  for (let i = 0; i < docs.length; i += batchSize) {
    const batch = docs.slice(i, i + batchSize);

    if (!index) {
      index = await VectorStoreIndex.fromDocuments(batch);
    } else {
      await index.insertDocuments(batch);
    }
  }

  return index!;
}

If you’re doing inference with a chat engine, the same issue appears when you send too much context into the prompt. You’ll often see errors like:

  • Error: OOM error during inference when scaling
  • RangeError: Invalid string length
  • Context length exceeded
  • Provider-side errors like 400 Bad Request: maximum context length exceeded

Other Possible Causes

1) Too many concurrent requests

If you fire off parallel queries with Promise.all, memory spikes fast.

// Bad
await Promise.all(queries.map((q) => queryEngine.query(q)));
// Better
for (const q of queries) {
  await queryEngine.query(q);
}

If you need concurrency, cap it:

import pLimit from "p-limit";

const limit = pLimit(2);
await Promise.all(queries.map((q) => limit(() => queryEngine.query(q))));

2) Chunk size is too large

Large chunks mean fewer nodes, but each node carries more text into embeddings and retrieval.

// Too large
const splitter = new SentenceSplitter({ chunkSize: 4096, chunkOverlap: 200 });
// Safer for most RAG workloads
const splitter = new SentenceSplitter({ chunkSize: 512, chunkOverlap: 50 });

This matters more if you use VectorStoreIndex with an embedding model that has strict token limits.

3) Retrieving too many nodes

Pulling back similarityTopK: 20 or 50 can balloon the prompt before inference starts.

const retriever = index.asRetriever({ similarityTopK: 20 });
const retriever = index.asRetriever({ similarityTopK: 3 });

For chat-style RAG, start small. Increase only if retrieval quality is clearly bad.

4) Model or provider mismatch

Sometimes the issue is not LlamaIndex itself. A smaller local model may not handle the same prompt size that worked on OpenAI or Anthropic.

const llm = new Ollama({
  model: "llama3",
});

If your local runtime has limited RAM/VRAM, reduce:

  • context window
  • max output tokens
  • number of retrieved chunks

Also check whether your embedding model is running locally. Local embeddings plus local generation doubles memory pressure.

How to Debug It

  1. Isolate indexing vs inference

    • Run just document ingestion first.
    • Then run just one query against a tiny dataset.
    • If indexing fails, the issue is batching or document size.
    • If querying fails, it’s usually context size or concurrency.
  2. Log input sizes

    • Print document count.
    • Print average chunk length.
    • Print retrieved node count.
    • Print final prompt length before sending to the LLM.
console.log({
  docs: docs.length,
  retrievedNodes: nodes.length,
  promptChars: prompt.length,
});
  1. Reduce everything to a minimal case
    • One document.
    • One query.
    • One worker.
    • Small chunk size.

If that works, scale one variable at a time until it breaks.

  1. Watch process memory
    • Use Node heap logging or container metrics.
    • If RSS climbs steadily during ingestion, you’re holding too much in memory.
    • If it spikes only on query time, look at retrieval and prompt construction.

Prevention

  • Batch ingestion and avoid Promise.all on large query sets unless you cap concurrency.
  • Keep chunk sizes modest and tune similarityTopK conservatively.
  • Prefer external vector stores for larger corpora instead of keeping everything in-process.
  • Test with production-like document sizes early. Small-doc success does not predict real-world stability.

If you’re building a TypeScript RAG service with LlamaIndex, treat memory as part of the API contract. The fix is usually not “more RAM,” it’s controlling how much data enters indexing and inference at each step.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides