How to Fix 'OOM error during inference' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inferencelangchaintypescript

When you see OOM error during inference in a LangChain TypeScript app, it means your process ran out of memory while building prompts, loading documents, or calling the model. In practice, this usually shows up when you stuff too much text into a single prompt, keep too many documents in memory, or stream large outputs through a Node process with a small heap.

The failure is usually not “LangChain is broken.” It’s almost always a data-shaping problem: too much context, too many tokens, or an accidental full-document load before the LLM call.

The Most Common Cause

The #1 cause is passing far too much text into the prompt chain at once.

In LangChain JS/TS, this often happens with StuffDocumentsChain, createStuffDocumentsChain, or manual prompt concatenation. You load a bunch of documents, then stuff all of them into one prompt and send that to the model.

Broken patternFixed pattern
Load everything and stuff it into one callSplit docs and use retrieval / map-reduce / smaller chunks
// BROKEN: everything gets stuffed into one prompt
import { ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";

const llm = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });

const docs = [
  new Document({ pageContent: hugeText1 }),
  new Document({ pageContent: hugeText2 }),
  new Document({ pageContent: hugeText3 }),
];

const prompt = ChatPromptTemplate.fromMessages([
  ["system", "Answer using the documents."],
  ["human", "{context}\n\nQuestion: {question}"],
]);

const chain = await createStuffDocumentsChain({
  llm,
  prompt,
});

await chain.invoke({
  context: docs,
  question: "Summarize the policy",
});
// FIXED: reduce the amount of text sent to the model
import { ChatOpenAI } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { Document } from "@langchain/core/documents";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";

const llm = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1200,
  chunkOverlap: 150,
});

const splitDocs = await splitter.splitDocuments([
  new Document({ pageContent: hugeText1 }),
  new Document({ pageContent: hugeText2 }),
]);

const prompt = ChatPromptTemplate.fromMessages([
  ["system", "Answer using only the provided context."],
  ["human", "{context}\n\nQuestion: {question}"],
]);

const chain = await createStuffDocumentsChain({
  llm,
  prompt,
});

// pass only relevant chunks after retrieval/filtering
await chain.invoke({
  context: splitDocs.slice(0, 4),
  question: "Summarize the policy",
});

If you’re using retrieval, don’t skip the retriever. A vector store should narrow context before inference. If you bypass that and pass raw documents directly, you’ll hit memory pressure fast.

Other Possible Causes

1) Your Node heap is too small

If you’re running local inference, large embeddings, or heavy document processing in Node, the default heap can be too tight.

# Increase heap size for Node
NODE_OPTIONS="--max-old-space-size=4096" npm run dev

This helps when the crash is happening in your app process, not inside a remote API call.

2) You’re loading entire files into memory

A common mistake is reading PDFs, HTML dumps, or JSON exports all at once before chunking.

// BROKEN
const raw = await fs.promises.readFile("large-policy.pdf");
// BETTER: use loaders + chunking pipeline
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const loader = new PDFLoader("large-policy.pdf");
const pages = await loader.load();

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 100,
});

const chunks = await splitter.splitDocuments(pages);

3) Your model context window is smaller than you think

Some models will fail hard when prompt + output exceeds their context window. With ChatOpenAI, ChatAnthropic, or AzureChatOpenAI, token limits matter.

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
  maxTokens: 200,
});

If your input is already huge, lowering maxTokens won’t fix it by itself. It only reduces output size; it does not shrink your prompt.

4) You are accidentally retaining large arrays between requests

This happens in server code where you cache every document batch or chain result in a module-level variable.

// BROKEN
const historyCache: any[] = [];

export async function handler(input: string) {
  historyCache.push(input); // grows forever
}

Keep request-scoped data request-scoped. If you need persistence, store summaries or IDs, not full transcripts.

How to Debug It

  1. Log token estimates before invocation

    • Measure input size before calling .invoke().
    • If your chain uses large docs, print character counts and chunk counts.
    • Example:
    console.log("chunks:", docs.length);
    console.log("chars:", docs.reduce((sum, d) => sum + d.pageContent.length, 0));
    
  2. Isolate whether the crash happens before or during model call

    • Wrap these stages separately:
      • loading documents
      • splitting documents
      • retrieving top-k chunks
      • invoking the LLM
    • If it dies before llm.invoke(), it’s probably preprocessing memory.
    • If it dies during invoke(), it’s likely prompt size or output size.
  3. Reduce input by half until it stops failing

    • Cut chunkSize, lower retriever k, and shorten system prompts.
    • If the error disappears when k=3 but returns at k=8, your retrieval fan-in is too high.
  4. Check runtime memory metrics

    • In Node:
    console.log(process.memoryUsage());
    
    • Watch heapUsed climb across requests.
    • A steady climb points to a leak or retained arrays; a sudden spike points to one oversized inference call.

Prevention

  • Use chunking as a default.
    • For document workflows, split first and retrieve second.
  • Cap retrieval results.
    • Start with k=3 or k=4, not k=20.
  • Keep prompts short.
    • Put instructions in the system message once.
    • Don’t repeat policies and examples across every request unless necessary.
  • Set operational limits.
const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
});
  • Add guards before inference:
if (docs.length > 8) {
  throw new Error("Too many documents for single-pass inference");
}

If you hit OOM error during inference in LangChain TypeScript, start with the input shape. In most cases, fixing document chunking and retrieval removes the problem without touching infrastructure.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides