-
Notifications
You must be signed in to change notification settings - Fork 47
Tuning CG on BGQ
This page is meant as a record of a few measurements I've been conducting using scalasca on 512 nodes of BG/Q. The measurements refer to 5 volume source inversions on a 48^3x96 configuration. This is for a very heavy quark mass, however, as the CG reaches a residue of O(1e-19) in 808 iterations.
Ignoring I/O and source preparation (which amount to more than 25% of total time! 12% is spent writing propagators!), we normalize total time spent to 74.33%, the proportion of time spent in cg_her. Of these 56% is spent applying Qtm_pm_psi with the usual overheads of the hopping matrix. The remaining 44% are spent doing linear algebra.
Starting situation
- 56% Qtm_pm_psi
- 11.8% scalar_prod_r
- 9.8% assign_add_mul_r
- 12.2% assign_mul_add_r_and_square
- 9.8% assign_add_mul_r (second call)
In the calls involving collectives about 20% of the time is spent waiting for MPI_Allreduce. About 50% of the time is spent doing the linear algebra and the remaining 30% is spent outside the parallel section. As far as I understand, this is usually a good measure of OpenMP overhead.
- Approximate breakdown of linear algebra routines with collectives:
- 20% MPI_Allreduce
- 50% parallel section doing linear algebra
- 30% outside of parallel section (OpenMP overhead)
- The routines without collectives look similar with slightly changed percentages due to the lack of MPI_Allreduce.
A comparison using a pure MPI run will be helpful in elucidating these points. In the following few tests, aspects of the linear algebra routines will be modified and the effect quantified here.