How to Fix 'OOM error during inference' in LlamaIndex (TypeScript)
When you see OOM error during inference in LlamaIndex TypeScript, it means your process ran out of memory while the model was generating a response or embedding data. In practice, this usually shows up when you send too much text at once, build a giant prompt, or load a model/context window that exceeds what your runtime can handle.
This is not a LlamaIndex bug in most cases. It’s usually a data-shape problem, batching problem, or model configuration problem.
The Most Common Cause
The #1 cause is trying to stuff too much text into a single LLM call.
In LlamaIndex TypeScript, this often happens when you use VectorStoreIndex.fromDocuments() on large documents and then query without chunking properly, or when you pass an oversized context into QueryEngine. The model then tries to process a prompt that is too large for memory.
Broken vs fixed pattern
| Broken | Fixed |
|---|---|
| Load full documents and query directly | Split into chunks before indexing |
| Build one huge prompt | Keep retrieved context bounded |
| Let defaults handle everything | Set chunk size and retrieval limits explicitly |
// ❌ Broken: huge documents go straight into indexing/querying
import { Document, VectorStoreIndex } from "llamaindex";
const docs = [
new Document({ text: veryLargeContractText }),
new Document({ text: anotherMassivePolicyText }),
];
const index = await VectorStoreIndex.fromDocuments(docs);
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({
query: "Summarize the cancellation clause and exclusions.",
});
console.log(response.toString());
// ✅ Fixed: chunk the content and cap retrieval
import {
Document,
VectorStoreIndex,
SentenceSplitter,
} from "llamaindex";
const docs = [
new Document({ text: veryLargeContractText }),
new Document({ text: anotherMassivePolicyText }),
];
const splitter = new SentenceSplitter({
chunkSize: 512,
chunkOverlap: 64,
});
const index = await VectorStoreIndex.fromDocuments(docs, {
transformations: [splitter],
});
const queryEngine = index.asQueryEngine({
similarityTopK: 3,
});
const response = await queryEngine.query({
query: "Summarize the cancellation clause and exclusions.",
});
console.log(response.toString());
If you see errors like:
- •
OOM error during inference - •
JavaScript heap out of memory - •
Failed to generate response due to memory pressure
this is the first place to look.
Other Possible Causes
1. Embedding too many nodes at once
If you ingest thousands of chunks in one shot, the embedding step can blow up memory before you even reach inference.
// Problematic
await VectorStoreIndex.fromDocuments(hugeDocs);
Fix it by batching ingestion:
// Better
for (const batch of batchedDocs) {
await VectorStoreIndex.fromDocuments(batch);
}
If your pipeline supports it, keep batch sizes small and predictable.
2. Context window too large for the selected model
Some models have small context windows. If your retriever returns too many nodes, the prompt gets inflated until inference fails.
const queryEngine = index.asQueryEngine({
similarityTopK: 10, // may be too high for long chunks
});
Reduce retrieval size:
const queryEngine = index.asQueryEngine({
similarityTopK: 2,
});
Also shorten chunk size if each chunk is large. A chunkSize of 2048 with topK=10 is asking for trouble.
3. Running on a constrained Node.js heap
TypeScript apps running under Node can hit heap limits before the model does.
node app.js
Run with more heap if your host allows it:
node --max-old-space-size=4096 app.js
This helps when the crash is actually JavaScript memory pressure during document prep, serialization, or post-processing.
4. Using an oversized prompt template
A custom prompt that includes raw document text, chat history, and retrieved nodes can explode token count fast.
const prompt = `
Answer using all of this:
${fullPolicyText}
${chatHistory}
${retrievedContext}
`;
Trim it down:
const prompt = `
Use only the retrieved context below.
If the answer isn't present, say so.
Context:
${retrievedContext}
`;
Keep prompts surgical. Don’t paste full source documents into them.
How to Debug It
- •
Check where the crash happens
- •During
fromDocuments()or embedding? It’s probably batching/chunking. - •During
query()or.chat()? It’s probably context window or prompt size. - •During serialization/rendering? It may be Node heap pressure.
- •During
- •
Log chunk counts and sizes
console.log("docs:", docs.length); console.log("avg chars:", docs.reduce((n, d) => n + d.text.length, 0) / docs.length);If chunks are huge or unbounded, fix splitting first.
- •
Lower retrieval aggressively
- •Set
similarityTopKto1or2 - •Reduce
chunkSize - •Remove extra chat history from the request
- •Set
- •
Test with one document Strip your pipeline down to one small doc and one short question. If that works, reintroduce complexity until it breaks.
Prevention
- •Always chunk documents before indexing. In TypeScript LlamaIndex apps, default splitting is rarely enough for long legal, insurance, or banking content.
- •Keep retrieval bounded. Start with low
similarityTopK, then increase only if recall is clearly poor. - •Watch total prompt size end-to-end: document chunks + chat history + instructions + output budget all count toward memory pressure.
If you’re building agent workflows for regulated domains, treat memory limits as part of your design constraints, not as an ops issue after deployment. The fix is usually smaller chunks, fewer retrieved nodes, and tighter prompts.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit