Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ww3_ufs1.3 reproducibility issue #452

Open
JessicaMeixner-NOAA opened this issue Aug 25, 2021 · 13 comments
Open

ww3_ufs1.3 reproducibility issue #452

JessicaMeixner-NOAA opened this issue Aug 25, 2021 · 13 comments
Labels
bug Something isn't working

Comments

@JessicaMeixner-NOAA
Copy link
Collaborator

Describe the bug
When running with the develop branch, the ww3_ufs1.3 test twice, the 20190830.030000.restart.glo_15m is not the same.

To Reproduce
Clone two copies of the develop branch (8/23/21) and run the reg test:
./bin/run_test -w work_a -m grdset_a -f -p mpirun -n 140 -t 4 -o all ../model ww3_ufs1.3
and then use matrix.comp to compare the output.

Expected behavior
All files should be the same.

FYI @ricampos @aliabdolali

@JessicaMeixner-NOAA JessicaMeixner-NOAA added the bug Something isn't working label Aug 25, 2021
@JessicaMeixner-NOAA
Copy link
Collaborator Author

Should this be solved before @ricampos starts his tests? @aliabdolali or @ricampos have either of you looked at this yet?

@ricampos
Copy link
Collaborator

This sounds more critical/important than the other issue about netcdf (where in the worst scenario we could just pick netcdf/4.7.2), since it is generating different initial conditions. Let me try to check here...

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@aliabdolali?

@aliabdolali
Copy link
Contributor

I agree that we should fix these two issues, but the issue with the partitions seems more important to me than restart reproducibility, as Ricardo's work does not need to meet operational requirements.
For that and in the list of our priorities, that can be tackled when we have restarts written in a NetCDF.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

JessicaMeixner-NOAA commented Nov 16, 2021

He's running cycled tests which rely on restart files... so actually I would say restarts are important.

@ricampos
Copy link
Collaborator

Hi. I ran 2 independent tests with the same regtest ufs1.3 at 2 ww3 installations. Same number of cores (120) and I could see the restart files are slightly different:
/work/noaa/marine/ricardo.campos/models/testbugrestart/01/WW3/regtests/ww3_ufs1.3/work_a/teste1mpi120c
/work/noaa/marine/ricardo.campos/models/testbugrestart/02/WW3/regtests/ww3_ufs1.3/work_a/teste1mpi120c
vbindiff is a useful command to visualize the differences.
I'm running another test in serial.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@ricampos using the vbindiff can you see which variables are different?

@ricampos
Copy link
Collaborator

not sure, it is not clear to me

vbindiff

@MatthewMasarik-NOAA
Copy link
Collaborator

Hi @ricampos, if this is the only difference, I believe you're in the clear. I think this is just saying the two ascii header strings are different. As a check, the first red hex number (top) 6C is ascii code for 'l', and bottom, 45 is 'E'. (ps, I didn't know about vbindiff, very cool!)

@aliabdolali
Copy link
Contributor

@ricampos Is vbindiff available on Orion or do you need to compile it? This is very useful, thanks

As another way to debug, we should see what is the difference between ww3_ufs1.3 (not identical) and one other similar case ww3_ufs1.2 b4b identical) . Both are using the same switch on three grids.
ufs1.3 has one ice at an initial time with hourly wind
ufs1.2 has daily ice hourly wind and three hrs current.

We should compare the ww3_grid_glo_15m.inp with ww3_grid_gnh_10m.inp to see the differences.
Also, we can run ww3_ufs1.2 with the forcing of ww3_ufs1.3 to make sure it leads to the same problem.
I'll conduct some experiments this week.

@ricampos
Copy link
Collaborator

thank you for letting us know, Matthew. I use vbindiff on my personal laptop only, not on Orion yet. I sent an email to Helpdesk to see if could be possible to use it there too.
Ali, I can also check the results of Hs of the nowcast (right after reading the restart) to confirm they are really the same. Something like np.sum(hs1-hs2) giving zero would confirm everything is ok.

@ricampos
Copy link
Collaborator

Another update, running in serial produce diff restart files too:
vbindiff_serial

@MatthewMasarik-NOAA
Copy link
Collaborator

I looked at this briefly last week, and have a very small comment to add. By looking at the hex-dumps for the aoc grid restarts: 1) there are multiple regions of differing values, 2) they occur near the ends of the files (see screenshot showing the first difference is found at 98% point in the restarts).
ww3_rst_bindiff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants