Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu test #3

Open
wants to merge 5 commits into
base: dev/gfdl
Choose a base branch
from
Open

Gpu test #3

wants to merge 5 commits into from

Conversation

nikizadehgfdl
Copy link
Owner

No description provided.

- Per routine tackle may not be such a good idea (like OMP case)
  nevertheless here's an attempt to do just that with ACC to see how timings and answers change.
- Here are the results of this update on the gpubox lscgpu50-d (Tesla V100).
  Although the timings look promissing, the answers might change by a lot.
  dsum1 and dsum2 are just some physical quantities (answers) that indicate how much the answers
  might change. It doesn't look to be just round-off!
  timings show how much faster this particular calculation might get on GPUS.
  The speed up kicks in at the second iteration because some of the data is already
  moved to device (via present_or_copy directives).

 DEV          dsum1                     dsum2              timins(sec)
 cpu          16862846.89627802         362957721989943.6  0.1062018871307373
 gpu          16862849.65544315         362957721989943.6  0.2714490890502930
 cpu          88276966.79257554         362957721989943.6  0.1057448387145996
 gpu          88276963.53479740         362957721989943.6  6.8866014480590820E-002
 cpu         -1223928.224251170         362957721989943.6  0.1059348583221436
 gpu         -1223928.228475637         362957721989943.6  6.8870782852172852E-002

- Note the speed up depends on how powerful the gpu device is. In this case it's an idle Tesla V100.
  In the cased of my busy worstation nvidia gpu, there is actaully a slow down.
- Per routine tackle may not be such a good idea (like OMP case)
  nevertheless here's an attempt to do just that with ACC to see how timings and answers change.
- Here are the results of this update on the gpubox lscgpu50-d (Tesla V100).
  Although the timings look promissing, the answers might change by a lot.
  dsum1 and dsum2 are just some physical quantities (answers) that indicate how much the answers
  might change. It doesn't look to be just round-off!
  timings show how much faster this particular calculation might get on GPUS.
  The speed up kicks in at the second iteration because some of the data is already
  moved to device (via present_or_copy directives).

     DEV          dsum1                     dsum2              timins(sec)
     cpu          16862846.89627802         362957721989943.6  0.1062018871307373
     gpu          16862849.65544315         362957721989943.6  0.2714490890502930
     cpu          88276966.79257554         362957721989943.6  0.1057448387145996
     gpu          88276963.53479740         362957721989943.6  6.8866014480590820E-002
     cpu         -1223928.224251170         362957721989943.6  0.1059348583221436
     gpu         -1223928.228475637         362957721989943.6  6.8870782852172852E-002

- Note the speed up depends on how powerful the gpu device is. In this case it's an idle Tesla V100.
  In the case of my busy worstation nvidia gpu, there is actaully a slow down.
- May be 1% speed-up for 1 MPI rank
- pgi profiler shows MOM_hor_visc.F90:horizontal_viscosity
  as one of the most sampled routines in OM4. Hence the choice to
  use openacc
- At this update there is almost no gain (neither a loss) in timings
  for this module by using a single gpu in addition to a single cpu (mpirun -np 1).
  This shows that unless we can delegate more loops to gpu and/or get rid of the
  "!$ACC update self" directive there is no point in running with gpus!
nikizadehgfdl pushed a commit that referenced this pull request Apr 30, 2021
MOM_domain_infra: Document FMS passthroughs
nikizadehgfdl pushed a commit that referenced this pull request Jan 7, 2022
* reads in porous topography parameters from CHANNEL_LIST_FILE

*new module to compute curve fit for porous topography

*porous constraints used to modify continuity_PPM, CoriolisAdv, and Rayleigh bottom channel drag
nikizadehgfdl pushed a commit that referenced this pull request Jan 7, 2022
(+) porous topography implementation
nikizadehgfdl pushed a commit that referenced this pull request Feb 10, 2022
  Use the por_face_area[UV] in the effective thickness calculations in
zonal_face_thickness and merid_face_thickness, so that they are more consistent
with their use elsewhere in the code for the relative weights in calculating the
barotropic accelerations.  Because these por_face_area arrays are still 1 in all
test cases, the answers are unchanged in any test cases from before a few weeks
ago, but there could be answer changes in cases that are using the very recently
added capability (in PR #3) to set fractional face areas.  This change was
discussed with Sam Ditkovsky, and agreed that there is no reason to keep the
ability to recover the previous answers in any cases that use the recently added
partial face width option.

  This commit also expanded the comments describing the h_u and h_v arguments to
btcalc(), zonal_face_thickness(), and merid_face_thickness() routines, the
diag_h[uv] elements of the accel_diag_ptrs type and the h_u and h_v elements of
the BT_cont_type.

  All answers and output are bitwise identical in the MOM6-examples test suite
and TC tests, but answer changes are possible in cases using a very recently
added code option.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant