How to Fix 'OOM error during inference during development' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inference-during-developmentllamaindextypescript

What the error means

OOM error during inference during development usually means your process ran out of memory while LlamaIndex was building embeddings, calling the LLM, or loading too much data into a single Node.js runtime. In TypeScript projects, this often shows up during local dev with ts-node, hot reload, or when indexing a large folder without batching.

The usual symptom is a crash near an embedding call or query call, often with messages like:

  • FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
  • Error: OOM error during inference
  • Node.js heap out of memory

The Most Common Cause

The #1 cause is loading too much data into memory at once and then asking LlamaIndex to embed or infer over all of it in one shot.

This happens a lot when developers do something like:

  • read every file into an array
  • create one giant Document[]
  • call VectorStoreIndex.fromDocuments(...) on the whole batch
  • run development with the default Node heap

Broken vs fixed pattern

Broken patternFixed pattern
Load everything into memoryChunk and batch documents
Build index from a huge arrayProcess incrementally
Use default Node heapIncrease heap only if needed
// ❌ Broken
import { VectorStoreIndex, Document } from "llamaindex";
import fs from "node:fs/promises";
import path from "node:path";

async function main() {
  const files = await fs.readdir("./data");
  const docs = await Promise.all(
    files.map(async (file) => {
      const text = await fs.readFile(path.join("./data", file), "utf8");
      return new Document({ text, metadata: { file } });
    })
  );

  // This can blow up memory during embedding/inference
  const index = await VectorStoreIndex.fromDocuments(docs);

  const engine = index.asQueryEngine();
  const response = await engine.query({ query: "Summarize the policy changes" });
  console.log(response.toString());
}

main();
// ✅ Fixed
import { VectorStoreIndex, Document } from "llamaindex";
import fs from "node:fs/promises";
import path from "node:path";

async function main() {
  const files = await fs.readdir("./data");
  const batchSize = 10;

  let index: VectorStoreIndex | undefined;

  for (let i = 0; i < files.length; i += batchSize) {
    const batch = files.slice(i, i + batchSize);

    const docs = [];
    for (const file of batch) {
      const text = await fs.readFile(path.join("./data", file), "utf8");
      docs.push(new Document({ text, metadata: { file } }));
    }

    // Build smaller batches instead of one huge in-memory load
    index = index
      ? await VectorStoreIndex.fromDocuments(docs, { appendToIndex: true })
      : await VectorStoreIndex.fromDocuments(docs);
  }

  if (!index) throw new Error("No documents found");

  const engine = index.asQueryEngine();
  const response = await engine.query({ query: "Summarize the policy changes" });
  console.log(response.toString());
}

main();

If your version of LlamaIndex does not support appendToIndex, use a persistent vector store and insert batches manually instead of rebuilding everything in one pass.

Other Possible Causes

1) Your chunk size is too large

Large chunks produce huge embeddings and bigger prompt contexts. That increases memory pressure fast.

// Too large
const splitterConfig = {
  chunkSize: 4000,
  chunkOverlap: 200,
};

Use smaller chunks:

// Better for dev
const splitterConfig = {
  chunkSize: 512,
  chunkOverlap: 64,
};

2) You are using an oversized model locally

If you run a local model through Ollama, LM Studio, or another runtime, the model itself may be eating most of RAM before LlamaIndex even starts inference.

// Heavy local model for dev box
const llmModel = "llama3.1:70b";

Try a smaller model first:

const llmModel = "llama3.1:8b";

3) You are creating repeated indexes inside a loop

A common mistake is rebuilding the whole index per request or per file change.

for (const request of requests) {
  const index = await VectorStoreIndex.fromDocuments(docs);
  await index.asQueryEngine().query({ query: request });
}

Build once and reuse:

const index = await VectorStoreIndex.fromDocuments(docs);

for (const request of requests) {
  const engine = index.asQueryEngine();
  await engine.query({ query: request });
}

4) Your Node process heap is too small

Sometimes the code is fine, but Node’s default heap is not enough for local indexing.

node --max-old-space-size=8192 dist/index.js

For tsx or ts-node, set it there too:

NODE_OPTIONS="--max-old-space-size=8192" npx tsx src/index.ts

How to Debug It

  1. Check where the crash happens

    • If it dies during VectorStoreIndex.fromDocuments(...), it’s usually document batching or chunking.
    • If it dies during .query(...), it’s usually prompt size or local model memory.
  2. Log document counts and chunk sizes

    • Print how many documents you load.
    • Print average text length before indexing.
    • If you see thousands of docs or multi-megabyte chunks, that’s your issue.
  3. Test with a tiny dataset

    • Run the same code on one file.
    • Then ten files.
    • Then your full corpus.
    • If it only fails at scale, you have a batching problem.
  4. Switch to a smaller model and lower chunk size

    • Drop to chunkSize: 512.
    • Use a smaller local model.
    • If memory stabilizes, you’ve confirmed the cause.

Prevention

  • Batch document ingestion instead of calling VectorStoreIndex.fromDocuments() on huge arrays.
  • Keep dev-time chunk sizes small unless you have a reason not to.
  • Set a realistic Node heap size for local indexing jobs:
NODE_OPTIONS="--max-old-space-size=8192"

If you’re building agents for production systems like banking or insurance workflows, treat indexing as a pipeline step, not an in-request operation. That one design choice avoids most OOM issues before they show up.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides