Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can dmn be run in parts? #7

Open
marcinschmidt opened this issue Feb 15, 2023 · 2 comments
Open

Can dmn be run in parts? #7

marcinschmidt opened this issue Feb 15, 2023 · 2 comments

Comments

@marcinschmidt
Copy link

marcinschmidt commented Feb 15, 2023

I've got quite a large dataset I want to analyse with dmn. Running it with
fit <- mclapply(1:20, dmn, count=count, verbose=TRUE)
on my desktop did not complete within 30 days (using all 4 cores). Probably power outage cancelled calculations as the system was reloaded. I divided the dataset into parts and run it also on a server. Some parts were finished but there is a 7-days limit and some needed more time. I would prefer to run the data as a full dataset.

Can I replace
fit <- mclapply(1:20, dmn, count=count, verbose=TRUE)
with

fit1 <- mclapply(1:7, dmn, count=count, verbose=TRUE)
fit2 <- mclapply(8:14, dmn, count=count, verbose=TRUE)
fit3 <- mclapply(15:20, dmn, count=count, verbose=TRUE)

How to combine fit1 (1:7), fit2 (8:14), and fit3 (15:20) into fit (1:20) ?

@mtmorgan
Copy link
Owner

mclapply just returns a list, so combining is just c(fit1, fit2, fit3). The vignette outlines additional steps to extract and work with individual components of the objects returned by dmn.

I wonder how big your data is? Also I wonder if the long running time is due to the size of the data or some other limitation, e.g., memory use.

Also is there something to do upstream to make the data smaller, e.g., some kind of dimensional reduction before doing the 'full' analysis; I have not worked in this space for a while so don't know if that is a good idea or not.

@marcinschmidt
Copy link
Author

marcinschmidt commented Feb 17, 2023

Hi! I run my data in chunks of [189, 8693], [191, 8693], and [197, 8693]. The server I used lately analysed with
benchmarkme::get_ram()
returns
201 GB
and
parallel::detectCores()
returns
48
for
plot(benchmarkme::benchmark_std())

gives
image
You are ranked 192 out of 749 machines.

image
You are ranked 419 out of 747 machines.

image
You are ranked 392 out of 747 machines.

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Does it mean for you anything specific? I'm biologist... and that is the most powerful machine I can use. Probably the dimensional reduction might be a solution. I will give it a try. When I submit my job to queue (SLURM) I use:

#SBATCH --nodes=1                # node count
#SBATCH --ntasks-per-node=4
#SBATCH --mem=38gb
#SBATCH --time=6-23:59:00          # total run time limit (HH:MM:SS)

I might try increasing number of nodes and mem to 128 or even 256 GB but time limit is 7 days anyway.
Let me know if you have any idea.
Best regards, Marcin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants