How to Fix 'OOM error during inference when scaling' in LlamaIndex (TypeScript)
When you see OOM error during inference when scaling in a LlamaIndex TypeScript app, it usually means your process ran out of memory while generating embeddings, calling an LLM, or building a large index. It tends to show up after you move from local testing to real traffic, larger documents, or concurrent requests.
In practice, this is rarely a “LlamaIndex bug.” It’s usually a batching, concurrency, or context-size problem in your own code.
The Most Common Cause
The #1 cause is loading too much data into memory at once during indexing or inference.
A common pattern is reading every file, creating every node, and embedding everything in one shot. That works for small datasets and then falls over as soon as the corpus grows.
| Broken pattern | Fixed pattern |
|---|---|
| Load all documents, split aggressively, embed everything concurrently | Process documents in chunks and cap concurrency |
// Broken: everything happens at once
import { SimpleDirectoryReader } from "llamaindex";
import { VectorStoreIndex } from "llamaindex";
async function buildIndex() {
const reader = new SimpleDirectoryReader("./docs");
const docs = await reader.loadData();
// Large corpus + default concurrency can blow memory
const index = await VectorStoreIndex.fromDocuments(docs);
return index;
}
// Fixed: batch documents before indexing
import { SimpleDirectoryReader, VectorStoreIndex } from "llamaindex";
async function buildIndex() {
const reader = new SimpleDirectoryReader("./docs");
const docs = await reader.loadData();
const batchSize = 50;
let index;
for (let i = 0; i < docs.length; i += batchSize) {
const batch = docs.slice(i, i + batchSize);
if (!index) {
index = await VectorStoreIndex.fromDocuments(batch);
} else {
await index.insertDocuments(batch);
}
}
return index!;
}
If you’re doing inference with a chat engine, the same issue appears when you send too much context into the prompt. You’ll often see errors like:
- •
Error: OOM error during inference when scaling - •
RangeError: Invalid string length - •
Context length exceeded - •Provider-side errors like
400 Bad Request: maximum context length exceeded
Other Possible Causes
1) Too many concurrent requests
If you fire off parallel queries with Promise.all, memory spikes fast.
// Bad
await Promise.all(queries.map((q) => queryEngine.query(q)));
// Better
for (const q of queries) {
await queryEngine.query(q);
}
If you need concurrency, cap it:
import pLimit from "p-limit";
const limit = pLimit(2);
await Promise.all(queries.map((q) => limit(() => queryEngine.query(q))));
2) Chunk size is too large
Large chunks mean fewer nodes, but each node carries more text into embeddings and retrieval.
// Too large
const splitter = new SentenceSplitter({ chunkSize: 4096, chunkOverlap: 200 });
// Safer for most RAG workloads
const splitter = new SentenceSplitter({ chunkSize: 512, chunkOverlap: 50 });
This matters more if you use VectorStoreIndex with an embedding model that has strict token limits.
3) Retrieving too many nodes
Pulling back similarityTopK: 20 or 50 can balloon the prompt before inference starts.
const retriever = index.asRetriever({ similarityTopK: 20 });
const retriever = index.asRetriever({ similarityTopK: 3 });
For chat-style RAG, start small. Increase only if retrieval quality is clearly bad.
4) Model or provider mismatch
Sometimes the issue is not LlamaIndex itself. A smaller local model may not handle the same prompt size that worked on OpenAI or Anthropic.
const llm = new Ollama({
model: "llama3",
});
If your local runtime has limited RAM/VRAM, reduce:
- •context window
- •max output tokens
- •number of retrieved chunks
Also check whether your embedding model is running locally. Local embeddings plus local generation doubles memory pressure.
How to Debug It
- •
Isolate indexing vs inference
- •Run just document ingestion first.
- •Then run just one query against a tiny dataset.
- •If indexing fails, the issue is batching or document size.
- •If querying fails, it’s usually context size or concurrency.
- •
Log input sizes
- •Print document count.
- •Print average chunk length.
- •Print retrieved node count.
- •Print final prompt length before sending to the LLM.
console.log({
docs: docs.length,
retrievedNodes: nodes.length,
promptChars: prompt.length,
});
- •Reduce everything to a minimal case
- •One document.
- •One query.
- •One worker.
- •Small chunk size.
If that works, scale one variable at a time until it breaks.
- •Watch process memory
- •Use Node heap logging or container metrics.
- •If RSS climbs steadily during ingestion, you’re holding too much in memory.
- •If it spikes only on query time, look at retrieval and prompt construction.
Prevention
- •Batch ingestion and avoid
Promise.allon large query sets unless you cap concurrency. - •Keep chunk sizes modest and tune
similarityTopKconservatively. - •Prefer external vector stores for larger corpora instead of keeping everything in-process.
- •Test with production-like document sizes early. Small-doc success does not predict real-world stability.
If you’re building a TypeScript RAG service with LlamaIndex, treat memory as part of the API contract. The fix is usually not “more RAM,” it’s controlling how much data enters indexing and inference at each step.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit