Can dmn be run in parts? #7

marcinschmidt · 2023-02-15T08:00:54Z

I've got quite a large dataset I want to analyse with dmn. Running it with
fit <- mclapply(1:20, dmn, count=count, verbose=TRUE)
on my desktop did not complete within 30 days (using all 4 cores). Probably power outage cancelled calculations as the system was reloaded. I divided the dataset into parts and run it also on a server. Some parts were finished but there is a 7-days limit and some needed more time. I would prefer to run the data as a full dataset.

Can I replace
fit <- mclapply(1:20, dmn, count=count, verbose=TRUE)
with

fit1 <- mclapply(1:7, dmn, count=count, verbose=TRUE)
fit2 <- mclapply(8:14, dmn, count=count, verbose=TRUE)
fit3 <- mclapply(15:20, dmn, count=count, verbose=TRUE)

How to combine fit1 (1:7), fit2 (8:14), and fit3 (15:20) into fit (1:20) ?

The text was updated successfully, but these errors were encountered:

mtmorgan · 2023-02-16T15:47:55Z

mclapply just returns a list, so combining is just c(fit1, fit2, fit3). The vignette outlines additional steps to extract and work with individual components of the objects returned by dmn.

I wonder how big your data is? Also I wonder if the long running time is due to the size of the data or some other limitation, e.g., memory use.

Also is there something to do upstream to make the data smaller, e.g., some kind of dimensional reduction before doing the 'full' analysis; I have not worked in this space for a while so don't know if that is a good idea or not.

marcinschmidt · 2023-02-17T09:56:22Z

Hi! I run my data in chunks of [189, 8693], [191, 8693], and [197, 8693]. The server I used lately analysed with
benchmarkme::get_ram()
returns
201 GB
and
parallel::detectCores()
returns
48
for
plot(benchmarkme::benchmark_std())

gives

You are ranked 192 out of 749 machines.

You are ranked 419 out of 747 machines.

You are ranked 392 out of 747 machines.

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Does it mean for you anything specific? I'm biologist... and that is the most powerful machine I can use. Probably the dimensional reduction might be a solution. I will give it a try. When I submit my job to queue (SLURM) I use:

#SBATCH --nodes=1                # node count
#SBATCH --ntasks-per-node=4
#SBATCH --mem=38gb
#SBATCH --time=6-23:59:00          # total run time limit (HH:MM:SS)

I might try increasing number of nodes and mem to 128 or even 256 GB but time limit is 7 days anyway.
Let me know if you have any idea.
Best regards, Marcin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can dmn be run in parts? #7

Can dmn be run in parts? #7

marcinschmidt commented Feb 15, 2023 •

edited

Loading

mtmorgan commented Feb 16, 2023

marcinschmidt commented Feb 17, 2023 •

edited

Loading

Can dmn be run in parts? #7

Can dmn be run in parts? #7

Comments

marcinschmidt commented Feb 15, 2023 • edited Loading

mtmorgan commented Feb 16, 2023

marcinschmidt commented Feb 17, 2023 • edited Loading

marcinschmidt commented Feb 15, 2023 •

edited

Loading

marcinschmidt commented Feb 17, 2023 •

edited

Loading