Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fb matrix divider #328

Merged
merged 26 commits into from
Mar 24, 2021
Merged

Fb matrix divider #328

merged 26 commits into from
Mar 24, 2021

Conversation

aliabdolali
Copy link
Contributor

@aliabdolali aliabdolali commented Mar 9, 2021

Hi @thesser1 @ukmo-ccbunney @mickaelaccensi
Following our discussion on the slow speed of our regression suite, I modified the script I used to divide the main matrix into a given number of subsets (it is defined by defining the number of tests in each subset), and tested it with the most recent code. The entire regtest took 3-4 hr in 7 subsets (each 100 test) on 24 cores. It would definitely faster if we change n=24 to n=8 cores.
There is still room for improvement, but I wanted to provide an engineering solution without touching any part of the code. In this way, we just duplicate the model directory and let each individual matrix does its simulation separately from the same parent and store all the solution in one place.

The extra step after matrix file generation is defining the max number of tests in each subset and executing ./bin/matrix_divider.sh
@JessicaMeixner-NOAA, what do you think?

@thesser1 @ukmo-ccbunney @mickaelaccensi could you check it at your end and provide suggestions?
One thing that I'll add in the future is dividing the parallel vs serial tests, so we can optimize the number of tests in each subset better. It will be an easy implementation.

UKMO-lsampson and others added 12 commits July 22, 2020 11:44
to ensure they comply with the limits of the nameslist.
Changes to add support to 360-day and 365-day (no leap year) calendar - see ticket #209
  * Additional CALTYPE namelist parameter in MISC section
  * New ww3_tc1 regtest.
* Updated ww3_bound and ww3_bounc to handle model grids formulated on a rotated pole.
* Manual and nml/inp files to updated clarify that ww3_bound/ww3_bounc only accept input spectra formulated on a standard pole grid.
Updates to allow a coupling time step that is different from the model time step. 
* Includes new regtest (in ww3_tp2.14) for non-default oasis time step.
* ww3_tp2.14 regtest added to matrix.base.
@JessicaMeixner-NOAA
Copy link
Collaborator

@aliabdolali thanks for providing an engineering solution. I'd be happy to check this out. Can you provide usage instructions?

@aliabdolali
Copy link
Contributor Author

@aliabdolali thanks for providing an engineering solution. I'd be happy to check this out. Can you provide usage instructions?

In line 37, define maxlist. The default is 100 which means 7 seubsets for the whole ~700 tests.
Once you generated the matrix, execute this script and it splits the matrix, and submit the jobs.

@JessicaMeixner-NOAA
Copy link
Collaborator

Thanks @aliabdolali and sorry you originally included the instructions and I missed it.

echo " echo ' * end of WAVEWATCH III matrix$i of regression tests *'" >> matrix$i
echo " echo ' **************************************************************'" >> matrix$i
echo " echo ' '" >> matrix$i
sbatch matrix$i
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this as this would be machine dependent and this could then not be used by others who are not using slurm or if we port to another machine that does not use sbatch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll remove it.
Once other thing that I'll do for ourself is adding another script which does cloning the FB and Dev, download the tar from ftp, prepare matrix, divide it and submit. So I will include this to that.

# --------------------------------------------------------------------------- #
# 1. clean up and definitions #
# --------------------------------------------------------------------------- #
rm before
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also add this to the end for cleaning up? Do we want to add these files to .gitignore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure,
Yes, we should add them to .gitignore

@JessicaMeixner-NOAA
Copy link
Collaborator

This is a great engineering fix and it's really nice that it's automated. We should see if this also works for others. I like the idea of committing this first, which would test that this split produces the same as the non-split matrix, but I still think changing from 24->8 is needed.

Something that I don't like about this is that it copies the model folder which then you lose git capabilities because these files aren't then tracked. That aspect of this is not my favorite, but without a major build change this definitely simplifies the running of the regression tests in multiple parts and that is really nice. I also like the fact that all the regression tests are already in the regrets folder. When I do this manually I usually just use multiple clones.

A possible future improvement would be to look at how we are splitting them instead of just equally into parts. For example, serial jobs could be separated and appropriate resources could be requested. Or other splitting for knowing which tests take what amount of time.

@aliabdolali
Copy link
Contributor Author

This is a great engineering fix and it's really nice that it's automated. We should see if this also works for others. I like the idea of committing this first, which would test that this split produces the same as the non-split matrix, but I still think changing from 24->8 is needed.

Something that I don't like about this is that it copies the model folder which then you lose git capabilities because these files aren't then tracked. That aspect of this is not my favorite, but without a major build change this definitely simplifies the running of the regression tests in multiple parts and that is really nice. I also like the fact that all the regression tests are already in the regrets folder. When I do this manually I usually just use multiple clones.

A possible future improvement would be to look at how we are splitting them instead of just equally into parts. For example, serial jobs could be separated and appropriate resources could be requested. Or other splitting for knowing which tests take what amount of time.

Later, I will add something to check if the job is done, then all extra model? should be deleted. I'll modify it now to reflect your suggestions and ask you to review, after the mege, we will ask others to check at their ends.
Your last suggestion was sth I thought about it, but that requires further work. I am working on it ...

@JessicaMeixner-NOAA
Copy link
Collaborator

There's no need to merge before others check, they can check before it is merged.

@aliabdolali
Copy link
Contributor Author

@JessicaMeixner-NOAA Do you want me to change 24 to 8 in this FB?

@JessicaMeixner-NOAA
Copy link
Collaborator

No, because the regression tests results will change and this should be a clean comparison for this commit.

@aliabdolali
Copy link
Contributor Author

aliabdolali commented Mar 9, 2021

@JessicaMeixner-NOAA @ukmo-ccbunney @thesser1 @mickaelaccensi
I made some further progress
I added another script that separates serial and parallel tests, then we can define the number of tests in each subset for parallel (maxlist1) and serial(maxlist2) matrix.
I also made it compatible with ifremer, ukmet and erdc infrastructures, but it should be tested on these platforms to make sure it is working and then we need to optimize maxlist1 and maxlist2.
please see
regtests/bin/matrix_divider_p.sh
We also can allocate one core for serial jobs.

@ukmo-ccbunney
Copy link
Collaborator

Hi @aliabdolali
This looks like a good idea. I will give it a test.

I have one small issue when running the script though - the path to ../model is not correct if I run the script in the regtests/bin directory. Where do you intend the script to be run? I assumed it would be run in matrix/bin as that is where the matrix files are...

To get it to run I had to copy my matrix file from regtests/bin to regtests.

Perhaps you always have your matrix files in the regtests dir?

@ukmo-ccbunney
Copy link
Collaborator

Something that I don't like about this is that it copies the model folder which then you lose git capabilities because these files aren't then tracked. That aspect of this is not my favorite, but without a major build change this definitely simplifies the running of the regression tests in multiple parts and that is really nice. I also like the fact that all the regression tests are already in the regrets folder. When I do this manually I usually just use multiple clones.

@JessicaMeixner-NOAA - something that I have been working on (but is not working properly yet) is having a "build" directory defined in WW3 that contains the obj, mod and scratch directories. That way, each invocation of w3_make can specify its own unique build directory, if required. It's a work in progress at the moment as I need to update the make_makefile.sh part to ensure the paths are set correctly in there. I will also need to figure out a way of handling changes to the switch file.

Perhaps this might tie in with Ali's changes at a later date (to avoid copying the whole model directory?)

Copy link
Collaborator

@ukmo-ccbunney ukmo-ccbunney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @aliabdolali - I hope you don't mind me hijacking the review.
I've spotted a few "rm" commands that need a "-r" flag to remove the model* directories.

echo " echo ' * end of WAVEWATCH III matrix$count of regression tests *'" >> matrix$count
echo " echo ' **************************************************************'" >> matrix$count
echo " echo ' '" >> matrix$count
echo "rm ../model$count" >> matrix$count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be "rm -r" to remove directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

echo " echo ' * end of WAVEWATCH III matrix$count of regression tests *'" >> matrix$count
echo " echo ' **************************************************************'" >> matrix$count
echo " echo ' '" >> matrix$count
echo "rm ../model$count" >> matrix$count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be "rm -r" to remove directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

echo " echo ' * end of WAVEWATCH III matrix$count of regression tests *'" >> matrix$count
echo " echo ' **************************************************************'" >> matrix$count
echo " echo ' '" >> matrix$count
echo "rm ../model$count" >> matrix$count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be "rm -r" to remove directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@aliabdolali
Copy link
Contributor Author

Hi @aliabdolali
This looks like a good idea. I will give it a test.

I have one small issue when running the script though - the path to ../model is not correct if I run the script in the regtests/bin directory. Where do you intend the script to be run? I assumed it would be run in matrix/bin as that is where the matrix files are...

To get it to run I had to copy my matrix file from regtests/bin to regtests.

Perhaps you always have your matrix files in the regtests dir?

Hi @aliabdolali
This looks like a good idea. I will give it a test.

I have one small issue when running the script though - the path to ../model is not correct if I run the script in the regtests/bin directory. Where do you intend the script to be run? I assumed it would be run in matrix/bin as that is where the matrix files are...

To get it to run I had to copy my matrix file from regtests/bin to regtests.

Perhaps you always have your matrix files in the regtests dir?

@ukmo-ccbunney Where do you run matrix_ukmet? I assume you do it in regtests. Then, do the same for matrix_divider_p.sh as matrix is located there.

@aliabdolali
Copy link
Contributor Author

Something that I don't like about this is that it copies the model folder which then you lose git capabilities because these files aren't then tracked. That aspect of this is not my favorite, but without a major build change this definitely simplifies the running of the regression tests in multiple parts and that is really nice. I also like the fact that all the regression tests are already in the regrets folder. When I do this manually I usually just use multiple clones.

@JessicaMeixner-NOAA - something that I have been working on (but is not working properly yet) is having a "build" directory defined in WW3 that contains the obj, mod and scratch directories. That way, each invocation of w3_make can specify its own unique build directory, if required. It's a work in progress at the moment as I need to update the make_makefile.sh part to ensure the paths are set correctly in there. I will also need to figure out a way of handling changes to the switch file.

Perhaps this might tie in with Ali's changes at a later date (to avoid copying the whole model directory?)

@ukmo-ccbunney this would be great. We can tackle it later.
BTW, I made a few other changes, please try it and if you like it, we can do the final retouch and merge it.
I would remove matrix_divider.sh and will keep matrix_divider_p.sh. It is well organized as it separates serial and parallel tests.

@aliabdolali aliabdolali added the enhancement New feature or request label Mar 18, 2021
@aliabdolali aliabdolali deleted the fb_matrix_divider branch March 22, 2021 23:27
@aliabdolali aliabdolali reopened this Mar 22, 2021
@JessicaMeixner-NOAA
Copy link
Collaborator

@aliabdolali thanks for addressing all my comments!

@aliabdolali
Copy link
Contributor Author

@JessicaMeixner-NOAA
I ran Develop without breaking it into subsets and I used this PR in 11 subsets. There is no missing case when we subset it (as expected)
Here are the outcomes:
MatrixDiff.zip
I also reran Develop twice to make sure the non-identical cases are not because of dividing the matrix.
It seems it is ready to be merged.

Copy link
Collaborator

@JessicaMeixner-NOAA JessicaMeixner-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aliabdolali

@aliabdolali aliabdolali merged commit 78b0148 into NOAA-EMC:develop Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants