Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coriolis: Improved coradcalc vectorization #1365

Merged
merged 2 commits into from
Apr 7, 2021

Conversation

marshallward
Copy link
Collaborator

This patch restructures the CorAdCalc function so that the loops are
more easily vectorized on a broader range of systems. The number of
memory access has also been slightly reduced.

We observed a 1.75x speedup on a modern consumer AMD processor (Ryzen 5
2600) and a 1.24x speedup on Gaea's Intel Xeons (E5-2697 v4).
Description

There are two major changes:

  • An if-block testing for Area_q was removed, and the h_neglect * Area_q term was replaced with a new vol_neglect term.

    This term is intended to prevent division by zero when the hArea_q is
    zero. Otherwise, it is meant to be below roundoff and have no impact
    on the calculation.

    Previously, a zero value of Area_q would force a division by zero.
    Using vol_neglect ensures that the denominator is always nonzero.

    The value is set to use H_subroundoff times an area of 0.1 mm2,
    suggested by Robert Hallberg as a hypothetical Kolmogorov scale.
    Numerical results are intended to be independent of this choice.

  • Two separated loops associated with the bounded Coriolis term were
    combined into a single loop, which reduced both the number of internal
    if-blocks and avoided redundant memory load/stores.

Other if-blocks inside of do-loops were moved outside of the loops.

I can provide two potential explanations for the difference in Intel and
AMD performance:

  • Masking instructions have a lower latency on Intel CPUs, which permit
    limited vectorization of if-blocks. Similar vectorization is not
    possible on AMD CPUs. So Intel is less likely to benefit from if-block
    re-ordering.

  • Our Intel nodes on Gaea have a lower RAM bandwidth, and see a smaller
    benefit from vectorization, which must required greater bandwidth.
    This speedup may be greater on a more modern Intel platform.

Although the code has been vectorized on both Intel and AMD platforms,
there are still many memory accesses per operation, which is limiting
performance.

The changes below are not expected to change any answers, and none were
detected. But since we are changing a core component, I'd suggest
reviewing this carefully.

Sample timings are provided below.

Runtime measurements

AMD Before:

(Ocean Coriolis & mom advection) 1.091571
(Ocean Coriolis & mom advection) 1.086183
(Ocean Coriolis & mom advection) 1.091197
(Ocean Coriolis & mom advection) 1.087709
(Ocean Coriolis & mom advection) 1.086990

AMD After:

(Ocean Coriolis & mom advection) 0.619346
(Ocean Coriolis & mom advection) 0.624106
(Ocean Coriolis & mom advection) 0.625438
(Ocean Coriolis & mom advection) 0.630169
(Ocean Coriolis & mom advection) 0.621736


Intel Before:

(Ocean Coriolis & mom advection) 0.981367
(Ocean Coriolis & mom advection) 0.982316
(Ocean Coriolis & mom advection) 0.986633
(Ocean Coriolis & mom advection) 0.981260
(Ocean Coriolis & mom advection) 0.982810

Intel After:

(Ocean Coriolis & mom advection) 0.788747
(Ocean Coriolis & mom advection) 0.797684
(Ocean Coriolis & mom advection) 0.786874
(Ocean Coriolis & mom advection) 0.792120
(Ocean Coriolis & mom advection) 0.795373

marshallward and others added 2 commits April 1, 2021 17:36
This patch restructures the CorAdCalc function so that the loops are
more easily vectorized on a broader range of systems. The number of
memory access has also been slightly reduced.

We observed a 1.75x speedup on a modern consumer AMD processor (Ryzen 5
2600) and a 1.24x speedup on Gaea's Intel Xeons (E5-2697 v4).
Description

There are two major changes:

- An if-block testing for `Area_q` was removed, and the `h_neglect *
  Area_q` term was replaced with a new `vol_neglect` term.

  This term is intended to prevent division by zero when the hArea_q is
  zero.  Otherwise, it is meant to be below roundoff and have no impact
  on the calculation.

  Previously, a zero value of Area_q would force a division by zero.
  Using vol_neglect ensures that the denominator is always nonzero.

  The value is set to use `H_subroundoff` times an area of 0.1 mm2,
  suggested by Robert Hallberg as a hypothetical Kolmogorov scale.
  Numerical results are intended to be independent of this choice.

- Two separated loops associated with the bounded Coriolis term were
  combined into a single loop, which reduced both the number of internal
  if-blocks and avoided redundant memory load/stores.

Other if-blocks inside of do-loops were moved outside of the loops.

I can provide two potential explanations for the difference in Intel and
AMD performance:

* Masking instructions have a lower latency on Intel CPUs, which permit
  limited vectorization of if-blocks. Similar vectorization is not
  possible on AMD CPUs. So Intel is less likely to benefit from if-block
  re-ordering.

* Our Intel nodes on Gaea have a lower RAM bandwidth, and see a smaller
  benefit from vectorization, which must required greater bandwidth.
  This speedup may be greater on a more modern Intel platform.

Although the code has been vectorized on both Intel and AMD platforms,
there are still many memory accesses per operation, which is limiting
performance.

The changes below are not expected to change any answers, and none were
detected. But since we are changing a core component, I'd suggest
reviewing this carefully.

Sample timings are provided below.

Runtime measurements
--------------------

AMD Before:

(Ocean Coriolis & mom advection)      1.091571
(Ocean Coriolis & mom advection)      1.086183
(Ocean Coriolis & mom advection)      1.091197
(Ocean Coriolis & mom advection)      1.087709
(Ocean Coriolis & mom advection)      1.086990

AMD After:

(Ocean Coriolis & mom advection)      0.619346
(Ocean Coriolis & mom advection)      0.624106
(Ocean Coriolis & mom advection)      0.625438
(Ocean Coriolis & mom advection)      0.630169
(Ocean Coriolis & mom advection)      0.621736

----

Intel Before:

(Ocean Coriolis & mom advection)      0.981367
(Ocean Coriolis & mom advection)      0.982316
(Ocean Coriolis & mom advection)      0.986633
(Ocean Coriolis & mom advection)      0.981260
(Ocean Coriolis & mom advection)      0.982810

Intel After:

(Ocean Coriolis & mom advection)      0.788747
(Ocean Coriolis & mom advection)      0.797684
(Ocean Coriolis & mom advection)      0.786874
(Ocean Coriolis & mom advection)      0.792120
(Ocean Coriolis & mom advection)      0.795373
@Hallberg-NOAA
Copy link
Collaborator

This PR has passed pipeline testing at https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/12422.

Copy link
Collaborator

@Hallberg-NOAA Hallberg-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that these changes are algorithmically correct, and the speedup that is documented in the notes is impressive!

@Hallberg-NOAA Hallberg-NOAA merged commit 7ec08cd into mom-ocean:dev/gfdl Apr 7, 2021
@marshallward marshallward deleted the corad_vec_v2 branch May 7, 2021 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants