Fb matrix divider #328

aliabdolali · 2021-03-09T00:10:48Z

Hi @thesser1 @ukmo-ccbunney @mickaelaccensi
Following our discussion on the slow speed of our regression suite, I modified the script I used to divide the main matrix into a given number of subsets (it is defined by defining the number of tests in each subset), and tested it with the most recent code. The entire regtest took 3-4 hr in 7 subsets (each 100 test) on 24 cores. It would definitely faster if we change n=24 to n=8 cores.
There is still room for improvement, but I wanted to provide an engineering solution without touching any part of the code. In this way, we just duplicate the model directory and let each individual matrix does its simulation separately from the same parent and store all the solution in one place.

The extra step after matrix file generation is defining the max number of tests in each subset and executing ./bin/matrix_divider.sh
@JessicaMeixner-NOAA, what do you think?

@thesser1 @ukmo-ccbunney @mickaelaccensi could you check it at your end and provide suggestions?
One thing that I'll add in the future is dividing the parallel vs serial tests, so we can optimize the number of tests in each subset better. It will be an easy implementation.

to ensure they comply with the limits of the nameslist.

Changes to add support to 360-day and 365-day (no leap year) calendar - see ticket #209 * Additional CALTYPE namelist parameter in MISC section * New ww3_tc1 regtest.

* Updated ww3_bound and ww3_bounc to handle model grids formulated on a rotated pole. * Manual and nml/inp files to updated clarify that ww3_bound/ww3_bounc only accept input spectra formulated on a standard pole grid.

Updates to allow a coupling time step that is different from the model time step. * Includes new regtest (in ww3_tp2.14) for non-default oasis time step. * ww3_tp2.14 regtest added to matrix.base.

JessicaMeixner-NOAA · 2021-03-09T13:58:08Z

@aliabdolali thanks for providing an engineering solution. I'd be happy to check this out. Can you provide usage instructions?

aliabdolali · 2021-03-09T14:16:06Z

@aliabdolali thanks for providing an engineering solution. I'd be happy to check this out. Can you provide usage instructions?

In line 37, define maxlist. The default is 100 which means 7 seubsets for the whole ~700 tests.
Once you generated the matrix, execute this script and it splits the matrix, and submit the jobs.

JessicaMeixner-NOAA · 2021-03-09T14:25:28Z

Thanks @aliabdolali and sorry you originally included the instructions and I missed it.

JessicaMeixner-NOAA · 2021-03-09T14:26:25Z

regtests/bin/matrix_divider.sh

+  echo "  echo '     *  end of WAVEWATCH III matrix$i of regression tests         *'"   >> matrix$i
+  echo "  echo '     **************************************************************'"   >> matrix$i
+  echo "  echo ' '"                                                                     >> matrix$i
+sbatch matrix$i


I would remove this as this would be machine dependent and this could then not be used by others who are not using slurm or if we port to another machine that does not use sbatch.

Sure, I'll remove it.
Once other thing that I'll do for ourself is adding another script which does cloning the FB and Dev, download the tar from ftp, prepare matrix, divide it and submit. So I will include this to that.

JessicaMeixner-NOAA · 2021-03-09T14:27:10Z

regtests/bin/matrix_divider.sh

+# --------------------------------------------------------------------------- #
+# 1.  clean up and definitions                                                #
+# --------------------------------------------------------------------------- #
+rm before


Perhaps also add this to the end for cleaning up? Do we want to add these files to .gitignore?

Sure,
Yes, we should add them to .gitignore

JessicaMeixner-NOAA · 2021-03-09T14:33:52Z

This is a great engineering fix and it's really nice that it's automated. We should see if this also works for others. I like the idea of committing this first, which would test that this split produces the same as the non-split matrix, but I still think changing from 24->8 is needed.

Something that I don't like about this is that it copies the model folder which then you lose git capabilities because these files aren't then tracked. That aspect of this is not my favorite, but without a major build change this definitely simplifies the running of the regression tests in multiple parts and that is really nice. I also like the fact that all the regression tests are already in the regrets folder. When I do this manually I usually just use multiple clones.

A possible future improvement would be to look at how we are splitting them instead of just equally into parts. For example, serial jobs could be separated and appropriate resources could be requested. Or other splitting for knowing which tests take what amount of time.

aliabdolali · 2021-03-09T14:38:31Z

This is a great engineering fix and it's really nice that it's automated. We should see if this also works for others. I like the idea of committing this first, which would test that this split produces the same as the non-split matrix, but I still think changing from 24->8 is needed.

Something that I don't like about this is that it copies the model folder which then you lose git capabilities because these files aren't then tracked. That aspect of this is not my favorite, but without a major build change this definitely simplifies the running of the regression tests in multiple parts and that is really nice. I also like the fact that all the regression tests are already in the regrets folder. When I do this manually I usually just use multiple clones.

A possible future improvement would be to look at how we are splitting them instead of just equally into parts. For example, serial jobs could be separated and appropriate resources could be requested. Or other splitting for knowing which tests take what amount of time.

Later, I will add something to check if the job is done, then all extra model? should be deleted. I'll modify it now to reflect your suggestions and ask you to review, after the mege, we will ask others to check at their ends.
Your last suggestion was sth I thought about it, but that requires further work. I am working on it ...

JessicaMeixner-NOAA · 2021-03-09T14:39:57Z

There's no need to merge before others check, they can check before it is merged.

aliabdolali · 2021-03-09T14:56:00Z

@JessicaMeixner-NOAA Do you want me to change 24 to 8 in this FB?

JessicaMeixner-NOAA · 2021-03-09T14:59:45Z

No, because the regression tests results will change and this should be a clean comparison for this commit.

… them

aliabdolali · 2021-03-09T20:26:35Z

@JessicaMeixner-NOAA @ukmo-ccbunney @thesser1 @mickaelaccensi
I made some further progress
I added another script that separates serial and parallel tests, then we can define the number of tests in each subset for parallel (maxlist1) and serial(maxlist2) matrix.
I also made it compatible with ifremer, ukmet and erdc infrastructures, but it should be tested on these platforms to make sure it is working and then we need to optimize maxlist1 and maxlist2.
please see
regtests/bin/matrix_divider_p.sh
We also can allocate one core for serial jobs.

ukmo-ccbunney · 2021-03-11T10:15:51Z

Hi @aliabdolali
This looks like a good idea. I will give it a test.

I have one small issue when running the script though - the path to ../model is not correct if I run the script in the regtests/bin directory. Where do you intend the script to be run? I assumed it would be run in matrix/bin as that is where the matrix files are...

To get it to run I had to copy my matrix file from regtests/bin to regtests.

Perhaps you always have your matrix files in the regtests dir?

ukmo-ccbunney · 2021-03-11T10:20:02Z

Something that I don't like about this is that it copies the model folder which then you lose git capabilities because these files aren't then tracked. That aspect of this is not my favorite, but without a major build change this definitely simplifies the running of the regression tests in multiple parts and that is really nice. I also like the fact that all the regression tests are already in the regrets folder. When I do this manually I usually just use multiple clones.

@JessicaMeixner-NOAA - something that I have been working on (but is not working properly yet) is having a "build" directory defined in WW3 that contains the obj, mod and scratch directories. That way, each invocation of w3_make can specify its own unique build directory, if required. It's a work in progress at the moment as I need to update the make_makefile.sh part to ensure the paths are set correctly in there. I will also need to figure out a way of handling changes to the switch file.

Perhaps this might tie in with Ali's changes at a later date (to avoid copying the whole model directory?)

ukmo-ccbunney

HI @aliabdolali - I hope you don't mind me hijacking the review.
I've spotted a few "rm" commands that need a "-r" flag to remove the model* directories.

ukmo-ccbunney · 2021-03-11T13:46:21Z

regtests/bin/matrix_divider_p.sh

+  echo "  echo '     *  end of WAVEWATCH III matrix$count of regression tests     *'"   >> matrix$count
+  echo "  echo '     **************************************************************'"   >> matrix$count
+  echo "  echo ' '"                                                                     >> matrix$count
+  echo "rm ../model$count"                                                              >> matrix$count


needs to be "rm -r" to remove directory

ukmo-ccbunney · 2021-03-11T13:46:33Z

regtests/bin/matrix_divider_p.sh

+  echo "  echo '     *  end of WAVEWATCH III matrix$count of regression tests     *'"   >> matrix$count
+  echo "  echo '     **************************************************************'"   >> matrix$count
+  echo "  echo ' '"                                                                     >> matrix$count
+  echo "rm ../model$count"                                                              >> matrix$count


needs to be "rm -r" to remove directory

ukmo-ccbunney · 2021-03-11T13:46:40Z

regtests/bin/matrix_divider_p.sh

+  echo "  echo '     *  end of WAVEWATCH III matrix$count of regression tests     *'"   >> matrix$count
+  echo "  echo '     **************************************************************'"   >> matrix$count
+  echo "  echo ' '"                                                                     >> matrix$count
+  echo "rm ../model$count"                                                              >> matrix$count


needs to be "rm -r" to remove directory

aliabdolali · 2021-03-11T14:59:52Z

Hi @aliabdolali
This looks like a good idea. I will give it a test.

I have one small issue when running the script though - the path to ../model is not correct if I run the script in the regtests/bin directory. Where do you intend the script to be run? I assumed it would be run in matrix/bin as that is where the matrix files are...

To get it to run I had to copy my matrix file from regtests/bin to regtests.

Perhaps you always have your matrix files in the regtests dir?

@ukmo-ccbunney Where do you run matrix_ukmet? I assume you do it in regtests. Then, do the same for matrix_divider_p.sh as matrix is located there.

aliabdolali · 2021-03-11T15:02:21Z

Something that I don't like about this is that it copies the model folder which then you lose git capabilities because these files aren't then tracked. That aspect of this is not my favorite, but without a major build change this definitely simplifies the running of the regression tests in multiple parts and that is really nice. I also like the fact that all the regression tests are already in the regrets folder. When I do this manually I usually just use multiple clones.

@JessicaMeixner-NOAA - something that I have been working on (but is not working properly yet) is having a "build" directory defined in WW3 that contains the obj, mod and scratch directories. That way, each invocation of w3_make can specify its own unique build directory, if required. It's a work in progress at the moment as I need to update the make_makefile.sh part to ensure the paths are set correctly in there. I will also need to figure out a way of handling changes to the switch file.

Perhaps this might tie in with Ali's changes at a later date (to avoid copying the whole model directory?)

@ukmo-ccbunney this would be great. We can tackle it later.
BTW, I made a few other changes, please try it and if you like it, we can do the final retouch and merge it.
I would remove matrix_divider.sh and will keep matrix_divider_p.sh. It is well organized as it separates serial and parallel tests.

regtests/bin/matrix_divider_p.sh

JessicaMeixner-NOAA · 2021-03-23T13:11:27Z

@aliabdolali thanks for addressing all my comments!

aliabdolali · 2021-03-24T14:08:07Z

@JessicaMeixner-NOAA
I ran Develop without breaking it into subsets and I used this PR in 11 subsets. There is no missing case when we subset it (as expected)
Here are the outcomes:
MatrixDiff.zip
I also reran Develop twice to make sure the non-identical cases are not because of dividing the matrix.
It seems it is ready to be merged.

JessicaMeixner-NOAA

Thanks @aliabdolali

UKMO-lsampson and others added 12 commits July 22, 2020 11:44

Added boundary checks to the SMC grid input files for ww3_grid,

98c5702

to ensure they comply with the limits of the nameslist.

Fb 360 calendar (#8)

71b9ba9

Changes to add support to 360-day and 365-day (no leap year) calendar - see ticket #209 * Additional CALTYPE namelist parameter in MISC section * New ww3_tc1 regtest.

RTD support for ww3_boun[dc] (#10)

6c42d17

* Updated ww3_bound and ww3_bounc to handle model grids formulated on a rotated pole. * Manual and nml/inp files to updated clarify that ww3_bound/ww3_bounc only accept input spectra formulated on a standard pole grid.

Fb coupling time (#9)

5c361aa

Updates to allow a coupling time step that is different from the model time step. * Includes new regtest (in ww3_tp2.14) for non-default oasis time step. * ww3_tp2.14 regtest added to matrix.base.

Merge remote-tracking branch 'upstream/staging' into develop

cc34122

bug fix for ukmet development

33a7af3

Merge remote-tracking branch 'upstream/develop' into develop

68c3c96

Merge remote-tracking branch 'upstream/develop' into develop

663d983

Merge remote-tracking branch 'upstream/develop' into develop

d4887f5

Merge remote-tracking branch 'upstream/develop' into develop

dc0cea9

add the matrix subsetter

832a6ea

clean-up

115ceb1

JessicaMeixner-NOAA reviewed Mar 9, 2021

View reviewed changes

clean up

4427883

add another script which separate serial and parallel jobs and divide…

8b498e8

… them

modify the script to remove ../model? after test completion.

e1cb650

ukmo-ccbunney suggested changes Mar 11, 2021

View reviewed changes

bug fixes and adding ww3_tp2.17 to list_heavy

88ee4f0

aliabdolali added the enhancement New feature or request label Mar 18, 2021

JessicaMeixner-NOAA reviewed Mar 19, 2021

View reviewed changes

regtests/bin/matrix_divider_p.sh Outdated Show resolved Hide resolved

Ali and others added 5 commits March 19, 2021 20:36

add if statement to remove matrix? and model?

3c74387

Update matrix_divider_p.sh

073e47d

Merge remote-tracking branch 'upstream/develop' into fb_matrix_divider

b07cfbb

Merge remote-tracking branch 'upstream/develop' into fb_matrix_divider

cf02516

Merge remote-tracking branch 'upstream/develop' into fb_matrix_divider

6578878

JessicaMeixner-NOAA reviewed Mar 22, 2021

View reviewed changes

regtests/bin/matrix_divider_p.sh Outdated Show resolved Hide resolved

Ali added 4 commits March 22, 2021 16:25

Merge remote-tracking branch 'upstream/develop' into fb_matrix_divider

799a8c4

put if check for ../model? inside matrix? loop

5c5bbff

fix the bug for sed for model?

62b756b

final fix for extra model? removal

8d7107a

aliabdolali closed this Mar 22, 2021

aliabdolali deleted the fb_matrix_divider branch March 22, 2021 23:27

aliabdolali reopened this Mar 22, 2021

add 2.21 to the list_heavy

2e5216e

JessicaMeixner-NOAA approved these changes Mar 24, 2021

View reviewed changes

aliabdolali merged commit 78b0148 into NOAA-EMC:develop Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fb matrix divider #328

Fb matrix divider #328

aliabdolali commented Mar 9, 2021 •

edited

Loading

JessicaMeixner-NOAA commented Mar 9, 2021

aliabdolali commented Mar 9, 2021

JessicaMeixner-NOAA commented Mar 9, 2021

JessicaMeixner-NOAA Mar 9, 2021

aliabdolali Mar 9, 2021

JessicaMeixner-NOAA Mar 9, 2021

aliabdolali Mar 9, 2021

JessicaMeixner-NOAA commented Mar 9, 2021

aliabdolali commented Mar 9, 2021

JessicaMeixner-NOAA commented Mar 9, 2021

aliabdolali commented Mar 9, 2021

JessicaMeixner-NOAA commented Mar 9, 2021

aliabdolali commented Mar 9, 2021 •

edited

Loading

ukmo-ccbunney commented Mar 11, 2021

ukmo-ccbunney commented Mar 11, 2021

ukmo-ccbunney left a comment

ukmo-ccbunney Mar 11, 2021

aliabdolali Mar 11, 2021

ukmo-ccbunney Mar 11, 2021

aliabdolali Mar 11, 2021

ukmo-ccbunney Mar 11, 2021

aliabdolali Mar 11, 2021

aliabdolali commented Mar 11, 2021

aliabdolali commented Mar 11, 2021

JessicaMeixner-NOAA commented Mar 23, 2021

aliabdolali commented Mar 24, 2021

JessicaMeixner-NOAA left a comment

Fb matrix divider #328

Fb matrix divider #328

Conversation

aliabdolali commented Mar 9, 2021 • edited Loading

JessicaMeixner-NOAA commented Mar 9, 2021

aliabdolali commented Mar 9, 2021

JessicaMeixner-NOAA commented Mar 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JessicaMeixner-NOAA commented Mar 9, 2021

aliabdolali commented Mar 9, 2021

JessicaMeixner-NOAA commented Mar 9, 2021

aliabdolali commented Mar 9, 2021

JessicaMeixner-NOAA commented Mar 9, 2021

aliabdolali commented Mar 9, 2021 • edited Loading

ukmo-ccbunney commented Mar 11, 2021

ukmo-ccbunney commented Mar 11, 2021

ukmo-ccbunney left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aliabdolali commented Mar 11, 2021

aliabdolali commented Mar 11, 2021

JessicaMeixner-NOAA commented Mar 23, 2021

aliabdolali commented Mar 24, 2021

JessicaMeixner-NOAA left a comment

Choose a reason for hiding this comment

aliabdolali commented Mar 9, 2021 •

edited

Loading

aliabdolali commented Mar 9, 2021 •

edited

Loading