How to Fix 'OOM error during inference in production' in LlamaIndex (TypeScript)
OOM during inference means your process ran out of memory while LlamaIndex was building prompts, embedding chunks, or calling the LLM. In TypeScript projects, this usually shows up in production when document volume spikes, chunk sizes are too large, or you accidentally keep too much data in memory during ingestion.
The failure is often noisy but the root cause is usually simple: you’re asking the runtime to hold more text, embeddings, or response state than the container can fit.
The Most Common Cause
The #1 cause is loading and indexing too much data at once. In LlamaIndex TypeScript, this usually happens when you call SimpleDirectoryReader.loadData() on a large directory and immediately pass the full array into VectorStoreIndex.fromDocuments() without batching.
Typical symptoms:
- •Node process memory climbs until it dies
- •You see
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory - •In container logs, it may surface as
OOMKilled - •LlamaIndex calls may fail around
VectorStoreIndex.fromDocuments,SentenceSplitter, orOpenAIEmbedding
Broken vs fixed pattern
| Broken | Fixed |
|---|---|
| Loads everything into memory at once | Streams or batches documents |
| Uses large chunk sizes | Uses smaller chunks |
| Builds one giant index in a single pass | Inserts incrementally |
// BROKEN
import { SimpleDirectoryReader } from "llamaindex";
import { VectorStoreIndex } from "llamaindex";
async function buildIndex() {
const reader = new SimpleDirectoryReader();
const docs = await reader.loadData("./contracts"); // loads everything
const index = await VectorStoreIndex.fromDocuments(docs); // big memory spike
return index;
}
// FIXED
import { SimpleDirectoryReader } from "llamaindex";
import { VectorStoreIndex } from "llamaindex";
import { Settings } from "llamaindex";
Settings.chunkSize = 512;
Settings.chunkOverlap = 64;
async function buildIndex() {
const reader = new SimpleDirectoryReader();
const docs = await reader.loadData("./contracts");
// Process smaller batches instead of one huge array
const batchSize = 25;
let index: VectorStoreIndex | null = null;
for (let i = 0; i < docs.length; i += batchSize) {
const batch = docs.slice(i, i + batchSize);
if (!index) {
index = await VectorStoreIndex.fromDocuments(batch);
} else {
await index.insertNodes(batch);
}
}
return index!;
}
If your dataset is large, the fix is not “add more RAM” first. The fix is to stop forcing the whole pipeline into one allocation spike.
Other Possible Causes
1) Chunk size is too large
Large chunks create huge embeddings and oversized prompt contexts. That increases both embedding memory and inference memory.
import { Settings } from "llamaindex";
Settings.chunkSize = 2048; // risky for long documents
Settings.chunkOverlap = 200;
Use something like this instead:
Settings.chunkSize = 512;
Settings.chunkOverlap = 64;
2) Too many retrieved nodes are stuffed into the prompt
If you set similarityTopK too high, the retriever sends too much context to the LLM. That inflates token count and can blow up memory during prompt assembly.
const queryEngine = index.asQueryEngine({
similarityTopK: 20, // often too high in production
});
Safer:
const queryEngine = index.asQueryEngine({
similarityTopK: 3,
});
If you need broader recall, use reranking or multi-step retrieval instead of dumping everything into one prompt.
3) Response synthesis keeps long outputs in memory
Some response modes accumulate all source text before generating a final answer. On large corpora, that becomes expensive fast.
const queryEngine = index.asQueryEngine({
responseMode: "compact", // can still be heavy with large retrieval sets
});
Prefer tighter retrieval plus a mode that reduces prompt growth. If your version supports it, test a more incremental synthesis strategy rather than one that concatenates everything.
4) Embedding model or LLM context window mismatch
A small container plus a large-context model can still OOM if you feed it huge prompts. This gets worse when using local models or self-hosted inference endpoints.
import { OpenAIEmbedding } from "llamaindex";
Settings.embedModel = new OpenAIEmbedding({
model: "text-embedding-3-large",
});
That model is fine, but if your chunking and retrieval are sloppy, the embedding payload grows fast. Reduce chunk size and top-k first.
How to Debug It
- •
Check whether the crash happens during ingestion or query time
If it dies onSimpleDirectoryReader.loadData()orVectorStoreIndex.fromDocuments(), it’s ingestion pressure. If it dies on.query()or.chat(), it’s prompt/context pressure. - •
Log document counts and chunk counts
Print how many docs you load and how many nodes get created before indexing.console.log({ docCount: docs.length });If doc count is small but node count explodes, your splitter settings are the problem.
- •
Lower chunk size and top-k temporarily
Set:- •
Settings.chunkSize = 256 - •
similarityTopK = 2
If the error disappears, you’ve confirmed a context bloat issue.
- •
- •
Watch real process memory
In Node.js:setInterval(() => { console.log(process.memoryUsage()); }, 5000);If heap usage climbs steadily during indexing, you’re retaining arrays or building too much in one pass.
Prevention
- •Keep chunk sizes conservative: start with
256–512tokens unless you have a reason not to. - •Batch ingestion for anything beyond a few hundred documents.
- •Keep retrieval tight: low
similarityTopK, then add reranking if needed. - •Test with production-like data volumes locally before shipping.
- •If you run in Docker/Kubernetes, set explicit memory limits and monitor
OOMKilledevents.
If you’re seeing FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory alongside LlamaIndex calls like VectorStoreIndex.fromDocuments() or .query(), treat it as a pipeline design issue first. In TypeScript apps, OOMs are usually caused by how data flows through LlamaIndex, not by LlamaIndex itself.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit