Investigate persisting input dataset in cluster memory on GWAS performance #449

tomwhite · 2021-01-28T17:49:40Z

In #390 it might help performance to persist the dosage data in cluster memory, and avoid loading it from Zarr for each phenotype. This could be tested using the benchmark using simulated data.

tomwhite · 2021-02-01T17:17:54Z

I did a quick test for this (notebook), by calling the following before calling the gwas function:

from dask.distributed import wait
XL = XL.persist()
wait(XL)

This has the effect of pinning the biggest matrix in the cluster memory.

The time went from 103s (load from Zarr + gwas) to 66s (gwas only), so since we run a computation for each phenotype on the same input, there are potential savings here of ~30%.

Performance reports:

(BTW I noticed that the actual computation times shown in the task stream were ~75s and ~40s respectively, so the saving may actually be more like 50%. I wonder if the missing time goes to serializing the task graph - if so then hopefully this will be improved by some of the work going on in Dask and Dask Distributed to move to HighLevelGraph. There's some discussion of the problem in dask/distributed#3872. We might also look at increasing chunk sizes to alleviate the problem.)

tomwhite added the performance label Jan 28, 2021

This was referenced Feb 3, 2021

Investigate use of preemptible GCP instances for GWAS #453

Closed

Identify lack of scalability in gwas_linear_regression #390

Open

tomwhite closed this as completed Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate persisting input dataset in cluster memory on GWAS performance #449

Investigate persisting input dataset in cluster memory on GWAS performance #449

tomwhite commented Jan 28, 2021

tomwhite commented Feb 1, 2021

Investigate persisting input dataset in cluster memory on GWAS performance #449

Investigate persisting input dataset in cluster memory on GWAS performance #449

Comments

tomwhite commented Jan 28, 2021

tomwhite commented Feb 1, 2021