Bartek's BGQ Scalasca Runs

Study of sample files with scalasca on BG/Q

It is difficult to estimate the effect of not OpenMP parallelizing a given function. This is especially true when trying to compare what happens on Intel machines with at most 4 threads and BG/Q with 64 threads. For instance, simple functions like deriv_Sb or update_backward_gauge get a clear performance hit from OpenMP on Intel. It is unclear, however, what the situation will be like when 1 thread suddenly has to compute deriv_Sb for a 64 thread volume! This is even more true for sw_all.

For this reason I will be doing a systematic set of runs of the various hmc sample files and ompnumthreads=64. I want to identify further avenues for optimization, especially in light of the amount of work required to get OpenMP parallelization working for functions with reductions and those where threads could potentially write into the same memory location.

In each of these runs the global volume is 48x24^3 and bg_size=32 and 4 trajectories are computed.

HMC3, 32_64

deriv_Sb

In this run it is striking that more than 20 percent of the total wallclock time is spent in deriv_Sb. This is because there is no OpenMP for deriv_Sb currently as the threads would write competitively into the same memory locations. Also, more than 6% of thread idling is due to deriv_Sb. Another 22% of thread idling comes, also from deriv_Sb but in a different location.

update_guage

More than 10% of the time is spent in update_gauge because there is no OpenMP there. Here, 12% of thread idling is localized! Between the two of them, update_gauge and deriv_Sb account for over 40% of thread idling!

deriv_Sb threaded

Implementing _trace_lambda_mul_add_assign with "tm_atomic" pragmas gives a factor 15 improvement for deriv_Sb and a total runtime saving of about 20 percent! (from 17 to 14 minutes)

square_norm

The non-OpenMP routines square_norm, scalar_prod, assign_add_mul_r and assign_mul_add_r account for a total of a little more than 10%. These need to be parallelized, but with lower priority than update_gauge, say.

square_norm threaded

The improvement for square_norm is excellent with a speedup of close to 40 and almost no management or idling overhead! Total runtime decreased to 11 minutes.

sclar_prod threaded

The improvement here is similar to that above.

update_gauge threaded

The next target of opportunity is update_gauge which now takes up almost 15 percent of the total wallclock time. The problem here is that we hit a bug [1], sort of confirmed by Dirk Pleiter, that this sort of local multithreaded update seemed to fail for some reason... and it still fails!!!

[1] https://github.com/etmc/tmLQCD/issues/121

Adding OpenMP parallelism to update_gauge gives a factor 25 speedup... so this is a roadblock because it doesn't provide correct results. I've tried adding ALIGN qualifiers to no avail. The next step will be to take it apart bit by bit to see what exactly is causing the problem!

HMC3, 512_4

After a great number of tests it seems that between 32_64 and 512_4, 512_4 is the clear performance winner for HMC3. It is of course possible that 128_16 or 256_8 could perform better, but this has not yet been tested!

This scalasca run with all the important functions OpenMP-parallelized except for update_gauge shows that in this configuration thread-management seems to be a non-issue. Thread-idling occurs only during MPI waits and in the non-parallelized update_gauge, adding up to only 10 percent of the total time. Thread management and MPI management account for about only 5% of the total time each! (By contrast with 64 threads, thread management takes up close to 40% of our total time, leading to severe thread idling!)

update_gauge with OpenMP HMC3, 32_64/512_4

Having resolved the OpenMP problem in update_gauge this final roadblock has been removed. Running tmcloverdet on 32_64 and 512_4, the total runtime of the 32_64 run is two thirds of the 512_4 run. Having turned off LEMON, both runs spent 20 seconds writing the gauge field at the end (no checkpoints were written). Right from the start it seems though that the 32_64 version has trouble with the solver, bicg taking up a factor of 5 more time in det_acc and others. Part of the imbalance arises from the lack of OpenMP in the BGQ versions of scalar_prod and co. Mainly however it seems to be that the functions called with the hopping matrix are less efficient in 32_64. For some reason the for loops in the hopping matrix have not been intrumented by scalasca and no detailed analysis is possible yet. This is due to our including of the hopping body.

Having copied the contents of halfspinor_body into Hopping_Matrix.c, I can now get a detailed look at what happens inside the hopping matrix. In the 512_4 version the for loops take longer but the barrier is much less expensive. On the other hand however, there seems to be a huge load on the MPI subsystem, with 16 threads calling MPI functions the communication block takes a factor of 50 (!!) longer to complete. This could be alleviated by using one of the ASYNC and MULTIPLE environment variables...

The 32_64 version spends a large factor of time more in functions involving the random number generator because there is no OpenMP there. Moving on, it seems that calling Qtm_plus_psi is a factor of 2 less efficient in 32_64 despite the good benchmark result. Although the hopping matrix itself is faster in 32_64, the surrounding functions from tm_operators take a factor of 4 more time. The volume loops are effectively more efficient (taking about half the time) but the barriers and thread management at the entrance of the parallel section really hit hard, being a factor of 4 slower at least.

The gauge derivative and update_gauge functions are much more efficient in 32_64 (by an order of magnitude!) while the det_derivative is only marginally more efficient. The assign_mul_add... functions are more efficient by a factor of almost 2 in 512_4. This is again due to thread-management and barriers. Perhaps different scheduling could help here!

The hopping matrix, as called from Qtm_pm_psi is more efficient in 32_64 by a factor of 1.8 or so. The reason for this is chiefly that calling xchange_halffield seems to waste a lot of time in the 512_4 case while it takes only a minimal fraction of total time spent in the function in the 32_64 case. The major time-eater here is the Waitall for some reason. (it takes a very short time in 32_64 while it takes almost three orders of magnitude more time in 512_4!)

benchmark, 512_4 / 32_64

In 512_4 thread idling is almost a non-issue, strangely enough though, it does affect the "nocomm" hopping matrix. This could be a side-effect of the "omp single" section which in the nocomm version of the hopping matrix will serve only as a barrier, with no threads launched inside. By contrast, in the version with communcation, MPI will be operating threads. This particular measurement should thus be taken with a grain of salt.

In the 512_4 version OpenMP management overhead is in the 4% region while MPI accounts for about 6%. In contrast, in the 32_64 version OpenMP management accounts for about 20% of total time! MPI on the other hand, costs less than one percent of total time here.

It must be noted that the scalasca overhead affects the readings here because without scalasca, the 32_64 benchmark is substantially faster than the 512_4 one. (401 vs 360 Mflops per thread)

HMC-TMCLOVERDET, 32_64 / 512_4

This sample traditionally has a very strong contribution from sw_all. Let's see whether we get errors from the potential write conflicts with 64 threads and whether the tm_atomic implementation fixes that and is sufficiently fast.

In terms of pure performance without scalasca 512_4 is faster than 32_64 (about 10%).

For 512_4 we have very little thread idling (total 10%), originating mostly from MPI communications. The OpenMP and MPI overheads account for a total of just under 5%! More than 50% of total time is spent in sw_all (from cloverdet_derivative). This is really problematic because in terms of visits, sw_all accounts for only 0.02 percent!

For 32_64 the computational balanance between sw_all and cg_her becomes better, with both using about 20% of the total time. Overall performance is lower, but this is mainly because of update_gauge lacking OpenMP parallelization as over 35% of total time is spent there! Total OpenMP overhead is manageable with about 10% of total time. One has to keep in mind that for 32_64 MPI overhead is nonexistant so this performs much better than for HMC3!

tm_atomic -> omp atomic in trace_lambda_mul_add_assign

Changing from the "tm_atomic" to "omp atomic" in the solution of the non-locality problem of trace_lambda_mul_add_assign results in a significant performance increase in sw_all. The total time spent for 512_4 goes down from 51 to 24%. Moving away from either "omp atomic" or "tm_atomic" reduces the impact of sw_all on total time to 19%. Although no loss of acceptance is observed, this could be down to the thermalisation and progress thereof. In the 64_24 version the effect is even more pronounced and time spent in sw_all is reduced to 8 and 5 percent respectively, much more in line with the very low number of total calls there.

update_gauge threaded, 32_64, 512_4

The main finding is that many of the functions like sw_term, sw_invert etc. work really well and the parallelization gives the same performance with 512_4 and 32_64. However, the functions called from Qsw_pm_psi seem to be a factor of 2 slower with 32_64 for some reason, despite the fact that the 32_64 hopping matrix is faster when used in the benchmark application. It is again clear that there is a balance between MPI and OMP overhead which is difficult to understand with just scalasca.

There is a further worrying observation in that the linear algebra functions, assign_mul_add_r and others, take a factor of 5 more time with 32_64 for some reason. From the scalasca data it seems like the loop body is very lean and the thread management overhead is the culprit for the bad scaling. The relevant initial parallel sections and barriers taking anywhere from 2 to 10 times longer to proceed.

The same is true for the hopping matrix itself, clover_inv and clover_gamma5. This becomes clear when attempting guided scheduling there. Suddenly the loop is fat and the barrier is very lean. Again, there is probably an optimal point which can be investigated by using guided scheduling with varying minimum chunk sizes.

Conclusions

For functions which exit quickly, node-local OpenMP is slightly less efficient than the 512_4 hybrid approach, with barriers and thread management being much less efficient (by huge factors). For larger, more complex chunks of code, OpenMP becomes much more efficient than the hybrid approach, sometimes by as much as a factor of 10, mainly due to large MPI_Waitall overheads. On the other hand, the OpenMP overhead in the hybrid approach is much lower (by a similar factor)!

There seems to be a very careful balance between the cost of MPI and the cost of OpenMP which hit at opposite ends of the parallelization spectrum, suggesting that there should be an optimal middle ground where total performance is maximized.

Building and running with scalasca on BG/Q

To compile an instrumented executable:

module load UNITE scalasca
cd builddir
export SKIN_MODE=none  #otherwise the compiler doesn't work during configure
../configure [...] CC="skin $XLCDIR/mpixlc_r"
unset SKIN_MODE #when compiling we want skinning to work, of course
make

To run an instrumented executable and generate an epik experiment:

# @ job_name         = BGQ_hmc3_hybrid_32_64_hs_scalasca
# @ error            = $(job_name).$(jobid).out
# @ output           = $(job_name).$(jobid).out
# @ environment      = COPY_ALL;
# @ wall_clock_limit = 00:15:00
# @ notification     = always
# @ notify_user      = [email protected]
# @ job_type         = bluegene
# @ bg_connectivity  = TORUS
# @ bg_size          = 32
# @ queue

module load UNITE scalasca

export NAME=BGQ_hmc3_hybrid_32_64_hs_scalasca
export OMP_NUM_THREADS=64
export ESD_BUFFER_SIZE=10000000 # this overflows with many MPI processes, so set to large number


if [[ ! -d ${WORK}/${NAME} ]]
then
  mkdir -p ${WORK}/${NAME}
fi

cd ${WORK}/${NAME}

cp /homea/hch02/hch028/code/tmLQCD.urbach/build_bgq_hybrid/hmc3_32_64.input ${WORK}/${NAME}/hmc.input

scan runjob --np 32 "--ranks-per-node 1" "--cwd ${WORK}/${NAME}" -- ${HOME}/code/tmLQCD.urbach/build_bgq_hybrid_hs_scalasca/hmc_tm

To analyze an epik experiment:

module load UNITE scalasca
cd $WORK/$NAME
square epik*

16 thread runs on Intel in Zeuthen

Since the BGQ is so full I'm approximating the effect on BG/Q by unsing the Zeuthen wgs and running a pure OpenMP version of the code. Early in the morning and after work hours one get get good measurements here. The machine supports 16 concurrent threads so that's nice.

Given the modest cache size of 8MB per CPU I will be using a local volume of 8^4 only. The raw benchmark result is about 17500 Mflops for the halfspinor version. The CG for HMC3 gives around 11500 Mflops.

no optimization

The local volume being smaller and the number of threads lower, the effect of deriv_Sb and update_gauge are not nearly as pronounced as on BG/Q. We have contributions of about 10 and 6 percent to total time spent and idling threads from update_gauge and deriv_Sb respectively.

There are contributions of about 3-4 percent each from square_norm and scalar_prod.

update_gauge threaded

The effect here is very good with a speedup of about 14 to the total time spent in update_gauge.

deriv_Sb threaded

OpenMP in deriv_Sb brings a factor 10 improvement. Clearly, the atomic statements in _trace_lambda_mul_add_assign have some overhead, but it is manageable!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly