Gpu test #3

nikizadehgfdl · 2020-04-27T17:12:15Z

No description provided.

- Per routine tackle may not be such a good idea (like OMP case) nevertheless here's an attempt to do just that with ACC to see how timings and answers change. - Here are the results of this update on the gpubox lscgpu50-d (Tesla V100). Although the timings look promissing, the answers might change by a lot. dsum1 and dsum2 are just some physical quantities (answers) that indicate how much the answers might change. It doesn't look to be just round-off! timings show how much faster this particular calculation might get on GPUS. The speed up kicks in at the second iteration because some of the data is already moved to device (via present_or_copy directives). DEV dsum1 dsum2 timins(sec) cpu 16862846.89627802 362957721989943.6 0.1062018871307373 gpu 16862849.65544315 362957721989943.6 0.2714490890502930 cpu 88276966.79257554 362957721989943.6 0.1057448387145996 gpu 88276963.53479740 362957721989943.6 6.8866014480590820E-002 cpu -1223928.224251170 362957721989943.6 0.1059348583221436 gpu -1223928.228475637 362957721989943.6 6.8870782852172852E-002 - Note the speed up depends on how powerful the gpu device is. In this case it's an idle Tesla V100. In the cased of my busy worstation nvidia gpu, there is actaully a slow down.

- Per routine tackle may not be such a good idea (like OMP case) nevertheless here's an attempt to do just that with ACC to see how timings and answers change. - Here are the results of this update on the gpubox lscgpu50-d (Tesla V100). Although the timings look promissing, the answers might change by a lot. dsum1 and dsum2 are just some physical quantities (answers) that indicate how much the answers might change. It doesn't look to be just round-off! timings show how much faster this particular calculation might get on GPUS. The speed up kicks in at the second iteration because some of the data is already moved to device (via present_or_copy directives). DEV dsum1 dsum2 timins(sec) cpu 16862846.89627802 362957721989943.6 0.1062018871307373 gpu 16862849.65544315 362957721989943.6 0.2714490890502930 cpu 88276966.79257554 362957721989943.6 0.1057448387145996 gpu 88276963.53479740 362957721989943.6 6.8866014480590820E-002 cpu -1223928.224251170 362957721989943.6 0.1059348583221436 gpu -1223928.228475637 362957721989943.6 6.8870782852172852E-002 - Note the speed up depends on how powerful the gpu device is. In this case it's an idle Tesla V100. In the case of my busy worstation nvidia gpu, there is actaully a slow down.

- May be 1% speed-up for 1 MPI rank

- pgi profiler shows MOM_hor_visc.F90:horizontal_viscosity as one of the most sampled routines in OM4. Hence the choice to use openacc - At this update there is almost no gain (neither a loss) in timings for this module by using a single gpu in addition to a single cpu (mpirun -np 1). This shows that unless we can delegate more loops to gpu and/or get rid of the "!$ACC update self" directive there is no point in running with gpus!

MOM_domain_infra: Document FMS passthroughs

* reads in porous topography parameters from CHANNEL_LIST_FILE *new module to compute curve fit for porous topography *porous constraints used to modify continuity_PPM, CoriolisAdv, and Rayleigh bottom channel drag

(+) porous topography implementation

Use the por_face_area[UV] in the effective thickness calculations in zonal_face_thickness and merid_face_thickness, so that they are more consistent with their use elsewhere in the code for the relative weights in calculating the barotropic accelerations. Because these por_face_area arrays are still 1 in all test cases, the answers are unchanged in any test cases from before a few weeks ago, but there could be answer changes in cases that are using the very recently added capability (in PR #3) to set fractional face areas. This change was discussed with Sam Ditkovsky, and agreed that there is no reason to keep the ability to recover the previous answers in any cases that use the recently added partial face width option. This commit also expanded the comments describing the h_u and h_v arguments to btcalc(), zonal_face_thickness(), and merid_face_thickness() routines, the diag_h[uv] elements of the accel_diag_ptrs type and the h_u and h_v elements of the BT_cont_type. All answers and output are bitwise identical in the MOM6-examples test suite and TC tests, but answer changes are possible in cases using a very recently added code option.

nikizadehgfdl added 5 commits October 23, 2019 10:14

OPENACC for one routine

a9bc8f8

- May be 1% speed-up for 1 MPI rank

To make it compile with PGI

c3881dd

nikizadehgfdl pushed a commit that referenced this pull request Apr 30, 2021

Merge pull request #3 from marshallward/pass_through_doc

3226f1b

MOM_domain_infra: Document FMS passthroughs

nikizadehgfdl pushed a commit that referenced this pull request Jan 7, 2022

Merge pull request #3 from sditkovsky/porous_topo_sjd

2087e05

(+) porous topography implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpu test #3

Gpu test #3

nikizadehgfdl commented Apr 27, 2020

Gpu test #3

Are you sure you want to change the base?

Gpu test #3

Conversation

nikizadehgfdl commented Apr 27, 2020