How to Fix 'OOM error during inference' in LlamaIndex (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inferencellamaindextypescript

When you see OOM error during inference in LlamaIndex TypeScript, it means your process ran out of memory while the model was generating a response or embedding data. In practice, this usually shows up when you send too much text at once, build a giant prompt, or load a model/context window that exceeds what your runtime can handle.

This is not a LlamaIndex bug in most cases. It’s usually a data-shape problem, batching problem, or model configuration problem.

The Most Common Cause

The #1 cause is trying to stuff too much text into a single LLM call.

In LlamaIndex TypeScript, this often happens when you use VectorStoreIndex.fromDocuments() on large documents and then query without chunking properly, or when you pass an oversized context into QueryEngine. The model then tries to process a prompt that is too large for memory.

Broken vs fixed pattern

BrokenFixed
Load full documents and query directlySplit into chunks before indexing
Build one huge promptKeep retrieved context bounded
Let defaults handle everythingSet chunk size and retrieval limits explicitly
// ❌ Broken: huge documents go straight into indexing/querying
import { Document, VectorStoreIndex } from "llamaindex";

const docs = [
  new Document({ text: veryLargeContractText }),
  new Document({ text: anotherMassivePolicyText }),
];

const index = await VectorStoreIndex.fromDocuments(docs);
const queryEngine = index.asQueryEngine();

const response = await queryEngine.query({
  query: "Summarize the cancellation clause and exclusions.",
});

console.log(response.toString());
// ✅ Fixed: chunk the content and cap retrieval
import {
  Document,
  VectorStoreIndex,
  SentenceSplitter,
} from "llamaindex";

const docs = [
  new Document({ text: veryLargeContractText }),
  new Document({ text: anotherMassivePolicyText }),
];

const splitter = new SentenceSplitter({
  chunkSize: 512,
  chunkOverlap: 64,
});

const index = await VectorStoreIndex.fromDocuments(docs, {
  transformations: [splitter],
});

const queryEngine = index.asQueryEngine({
  similarityTopK: 3,
});

const response = await queryEngine.query({
  query: "Summarize the cancellation clause and exclusions.",
});

console.log(response.toString());

If you see errors like:

  • OOM error during inference
  • JavaScript heap out of memory
  • Failed to generate response due to memory pressure

this is the first place to look.

Other Possible Causes

1. Embedding too many nodes at once

If you ingest thousands of chunks in one shot, the embedding step can blow up memory before you even reach inference.

// Problematic
await VectorStoreIndex.fromDocuments(hugeDocs);

Fix it by batching ingestion:

// Better
for (const batch of batchedDocs) {
  await VectorStoreIndex.fromDocuments(batch);
}

If your pipeline supports it, keep batch sizes small and predictable.

2. Context window too large for the selected model

Some models have small context windows. If your retriever returns too many nodes, the prompt gets inflated until inference fails.

const queryEngine = index.asQueryEngine({
  similarityTopK: 10, // may be too high for long chunks
});

Reduce retrieval size:

const queryEngine = index.asQueryEngine({
  similarityTopK: 2,
});

Also shorten chunk size if each chunk is large. A chunkSize of 2048 with topK=10 is asking for trouble.

3. Running on a constrained Node.js heap

TypeScript apps running under Node can hit heap limits before the model does.

node app.js

Run with more heap if your host allows it:

node --max-old-space-size=4096 app.js

This helps when the crash is actually JavaScript memory pressure during document prep, serialization, or post-processing.

4. Using an oversized prompt template

A custom prompt that includes raw document text, chat history, and retrieved nodes can explode token count fast.

const prompt = `
Answer using all of this:
${fullPolicyText}
${chatHistory}
${retrievedContext}
`;

Trim it down:

const prompt = `
Use only the retrieved context below.
If the answer isn't present, say so.
Context:
${retrievedContext}
`;

Keep prompts surgical. Don’t paste full source documents into them.

How to Debug It

  1. Check where the crash happens

    • During fromDocuments() or embedding? It’s probably batching/chunking.
    • During query() or .chat()? It’s probably context window or prompt size.
    • During serialization/rendering? It may be Node heap pressure.
  2. Log chunk counts and sizes

    console.log("docs:", docs.length);
    console.log("avg chars:", docs.reduce((n, d) => n + d.text.length, 0) / docs.length);
    

    If chunks are huge or unbounded, fix splitting first.

  3. Lower retrieval aggressively

    • Set similarityTopK to 1 or 2
    • Reduce chunkSize
    • Remove extra chat history from the request
  4. Test with one document Strip your pipeline down to one small doc and one short question. If that works, reintroduce complexity until it breaks.

Prevention

  • Always chunk documents before indexing. In TypeScript LlamaIndex apps, default splitting is rarely enough for long legal, insurance, or banking content.
  • Keep retrieval bounded. Start with low similarityTopK, then increase only if recall is clearly poor.
  • Watch total prompt size end-to-end: document chunks + chat history + instructions + output budget all count toward memory pressure.

If you’re building agent workflows for regulated domains, treat memory limits as part of your design constraints, not as an ops issue after deployment. The fix is usually smaller chunks, fewer retrieved nodes, and tighter prompts.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides