Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing openmp issues with FMS2020 cpu affinity #1148

Conversation

nikizadehgfdl
Copy link
Contributor

  • This update fixes the mom6-solo test crashes with openmp with symptoms
    FATAL: input domain does not have an io_domain.

  • With this update openmp runs with 1 and 2 threads give the same answers as non-openmp
    for all 3 compilers

  • FMS2020 has newer cpu affinity work. These are mostly to fix the
    issues with thread placing and hyperthreadng under slurm on gaea.
    But it also works on Orion.

  • The new affinity module simplifies the thread-placing calls in the
    component models.

  • NOTE: I don't remember why we put the thread placing calls in MOM_domains.F90
    They look unnecessary and the whole #ifndef NOT_SET_AFFINITY block
    can probably be removed. ocean_nthreads is either set in coupler or solo_driver.
    The only piece I am not sure about is how to set hyperthreading to false for Ocean and true for ATM in coupled runs.

- FMS2020 has newer cpu affinity work. These are mostly to fix the
  issues with thread placing and  hyperthreadng under slurm on gaea.
  But it also works on Orion.
- The new affinity module simplifies the thread-placing calls in the
  component models.
- The name of some functions has changed, that's the reason for crashes
  like:
      FATAL: input domain does not have an io_domain.
- This update fixes those issues.
- openmp runs with 1 and 2 threads gives the same answers as non-openmp
- NOTE: I don't rememer why we put the thread placing calls in MOM_domains.F90
        They look as unnecessary and the whole #ifndef NOT_SET_AFFINITY block
        can probably be removed. ocean_nthreads is either set in coupler
or solo_driver.
@codecov-commenter
Copy link

codecov-commenter commented Jun 28, 2020

Codecov Report

Merging #1148 into dev/gfdl will decrease coverage by 0.30%.
The diff coverage is 33.09%.

Impacted file tree graph

@@             Coverage Diff              @@
##           dev/gfdl    #1148      +/-   ##
============================================
- Coverage     46.08%   45.78%   -0.31%     
============================================
  Files           214      223       +9     
  Lines         69399    69835     +436     
============================================
- Hits          31984    31972      -12     
- Misses        37415    37863     +448     
Impacted Files Coverage Δ
...g_src/external/GFDL_ocean_BGC/FMS_coupler_util.F90 0.00% <0.00%> (ø)
...fig_src/external/GFDL_ocean_BGC/generic_tracer.F90 0.00% <0.00%> (ø)
...c/external/GFDL_ocean_BGC/generic_tracer_utils.F90 0.00% <0.00%> (ø)
config_src/external/ODA_hooks/kdtree.f90 0.00% <0.00%> (ø)
config_src/external/ODA_hooks/ocean_da_core.F90 0.00% <0.00%> (ø)
config_src/external/ODA_hooks/ocean_da_types.F90 0.00% <0.00%> (ø)
config_src/external/ODA_hooks/write_ocean_obs.F90 0.00% <0.00%> (ø)
config_src/solo_driver/MESO_surface_forcing.F90 0.00% <0.00%> (ø)
config_src/solo_driver/MOM_driver.F90 68.72% <ø> (ø)
config_src/solo_driver/user_surface_forcing.F90 0.00% <0.00%> (ø)
... and 116 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9aed664...4647870. Read the comment docs.

@marshallward
Copy link
Collaborator

marshallward commented Jul 1, 2020

(Relaying what we discussed with Rusty)

It seems that the fms_affinity_* functions require explicit CPU affinities and will fail if more CPUs are available than requested.

For now, I think we may be able to resolve this by adding an environment variable to our single-thread OpenMP tests:

GOMP_CPU_AFFINITY="0" ../../../build/openmp/MOM6

I will make this update and then will re-run these tests to see if it resolves the problem.

@marshallward
Copy link
Collaborator

Using GOMP_CPU_AFFINITY seems to work in this case:

https://travis-ci.org/github/marshallward/MOM6/jobs/704030161

Not sure if this is really what we want or the way we expect it to be run, but it will keep the tests passing.

@adcroft adcroft self-assigned this Jul 7, 2020
Copy link
Collaborator

@adcroft adcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling manually to fix white space issues.

@adcroft adcroft merged commit 10afb0b into mom-ocean:dev/gfdl Jul 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants