Information on how to run SWIFT with Scotch mapping and the test environment used on Cosma 8. Code has been tested with Scotch version 7.0.2.
Last update 18th August 2023.
Obtaining Scotch (as per gitlab repo)
Scotch is publicly available under the CeCILL-C free software license, as described here. The license itself is available here.
To use the lastest version of Scotch, please clone the master branch:
git clone [email protected]:scotch/scotch.git
Tarballs of the Scotch releases are available here.
Environment
module load cosma/2018 python/3.6.5 intel_comp/2022.1.2 compiler openmpi/4.1.4 fftw/3.3.9 parallel_hdf5/1.12.0 parmetis/4.0.3-64bit metis/5.1.0-64bit gsl/2.5
module load cmake
module load bison
Navigate to the Scotch directory and carry out the following commands
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/path-to-install-dir ..
make -j5
make install
Follow the usual installation instructions but if Scotch installed locally the added --with-scotch=/full/scotch/install/dir/path/
flag will need to be passed to ./configure
There are also two beta-testing Scotch modules available at the time of this writing:
(as indicated by the dot before the version number, .7.0.4
):
You can have the modules environment as
module load cosma/2018 python/3.6.5 intel_comp/2022.1.2 compiler openmpi/4.1.4 fftw/3.3.9 parallel_hdf5/1.12.0 parmetis/4.0.3-64bit metis/5.1.0-64bit gsl/2.5'
Then,
module load scotch/.7.0.4-32bit
or
module load scotch/.7.0.4-64bit
depending on what version of Scotch, 32-bit or 64-bit, you want to use.
Warning on compiler inclusion preferences
You can use the modules
environment to load a suitable Scotch module. In Cosma8
you can use module load scotch/.7.0.4-32bit
or module load scotch/.7.0.4-64bit
to get the 32- or the 64-bit compiled module for Scotch. At the moment of this writing, both 32 and 64 bit versions of Scotch can be loaded together, i.e. there is no exclusion check in the module files. There could be a case where both are needed, for testing or development purposes, for example. Then, care must be taken to avoid unintentional compile time behaviour: the order of the -I/.../...
commandline switches that will find their way in the Makefile
(s) depends on what the compiler used prefers in the order of inclusion -- so if you have loaded first the 32 bit module, its corresponding -I
switch will be placed first, and if the 64 bit module is loaded afterwards in the environment, it will be placed after, etc.
Also, in the case of using a locally-built Scotch package, it is advised that you do not load any Scotch module in advance in the environment, and instead use ./configure
with the --with-scotch=/full/scotch/install/dir/path/
in order to make sure that the Makefile
will pick up the correct include and library files.
Scotch carries out a mapping of a source (or process) graph onto a target (or architecture) graph. The weighted source graph is generated by SWIFT and it captures the computation and communication cost across the computational domain. The target graph defines the communication cost across the available computing architecture. Therefore, to make use of the Scotch mapping alogrithms a target architecture file (target.tgt) must be generated and it should mirror the set up of the cluster being used. Scotch provides optimised architecture files which capture most HPC set ups. For Cosma8 runs we will be targetting NUMA regions therefore we have modelled the architecture as a tleaf
structure.
In the following examples it is assumed that one mpi rank is mapped to each Cosma 8 NUMA region. This enforces that cpus-per-task=16
is defined in the SLURM submission script. The Cosma 8 nodes consist of 8 NUMA regions per node, with 4 NUMA regions per socket. Example tleaf
files for various setups are given below, where the intrasocket communication cost between NUMA regions is set at 5, intranode but across sockets is set at 10 and the internode cost is set at 1000. These weightings are estimated values but have been shown to give satisfactory results in the testcases explored. An estimate of the relative latency between NUMA regions on a node can be obtained by using hwloc, specifically by using hwloc-distances
.
Example tleaf graphs to represent various Cosma8 configurations
Number of nodes | Number of MPI ranks | tleaf |
---|---|---|
1 | 2 | tleaf 1 2 5 |
1 | 8 | tleaf 2 2 10 4 5 |
2 | 16 | tleaf 3 2 1000 2 10 4 5 |
4 | 32 | tleaf 3 4 1000 2 10 4 5 |
8 | 64 | tleaf 3 8 1000 2 10 4 5 |
The first index denotes the number of layers of the tleaf structure and the subsequent index pairs are the number of nodes in a layer and the cost to communicate between them. For example the numbers represented in the 64 MPI rank case (tleaf 3 8 1000 2 10 4 5
) are as follows: reading left to right the 3
denotes the number of layers in the leaf structure, 8
denotes the number of vertices in the first layer (which corresponds to 8 compute nodes), 1000
the relative cost for internode communication, 2
denotes the number of vertices in the second layer (number of sockets on each node), 10
relative cost for intersocket communication, 4
denotes the number of vertices in the final layer (number of NUMA regions per socket on cosma8) and finally 5
is the relative cost of intrasocket communication.
The user needs to define this tleaf structure and save it as target.tgt
in the directory they will run SWIFT from. Ongoing work focuses on automatically generating this target architecture upon run time.
With OpenMPI the mpirun
option --map-by numa
has been found to be optimal. This is in contrast to previously suggested --bind-to none
on the cosma8 site.
Scotch carries out the mapping using various strategies which are outlined in the documentation, a list of the strategies trialed include: SCOTCH_STRATDEFAULT
, SCOTCH_STRATBALANCE
, SCOTCH_STRATQUALITY
and SCOTCH_STRATSPEED
. The Scotch strategy string is passed to the Mapping functions. For the runs carried out here it was found that the global flag SCOTCH_STRATBALANCE
and an imbalance ratio of 0.05
worked best. These values are passed to SCOTCH_stratGraphMapBuild
.
One issue with Scotch is that when the number of mpi ranks is comparable to the dimensionality of the modelled SWIFT system the optimal mapping strategy doesn't neccessarily map to all available NUMA regions. At present this isn't handled explicity in the code and the paritition reverts to a vectorised or previous partitioning.
The SWIFT edge and vertex weights are estimated in the code, however edge weights are not symmetric - this causes an issue with SWIFT. Therefore, in the SCOTCH Graph the edge weigths are updated to equal to the average value (sum/2) of the two associated edge weights as calculated from SWIFT.
The following results were obtained on cosma8 running with the SCOTCH_STRATBALANCE
strategy string and an imbalance ratio of 0.05
.
Testcase | Resources | flags | Node types | Scotch32 (s) | Scotch64 (s) | Scotch local (s) | ParMetis (s) | Metis (s) |
---|---|---|---|---|---|---|---|---|
EAGLE_006 | nodes = 1 (8 NUMA regions) | --map_by numa |
Milan | 1191.8 | 1198.2 | 1173.6 | 1167.4 | 1176.4 |
-//- | -//- | Milan | 1176.7 | 1184.4 | 1193.8 | 1212.5 | 1182.1 | |
-//- | -//- | Milan | 1174.5 | 1183.6 | 1175.2 | 1229.4 | 1180.7 | |
-//- | -//- | Rome | 1368.8 | 1322.9 | 1351.5 | 1332.8 | 1334.9 | |
-//- | -//- | Rome | 1378.3 | 1373.8 | 1353.4 | 1332.3 | 1346.8 | |
-//- | -//- | Rome | 1367.3 | 1395.0 | 1361.0 | 1331.2 | 1330.8 | |
nodes = 2 (16 NUMA regions) | -//- | Milan | 1191.2 | 1225.8 | 1167.6 | 1154.0 | 1159.7 | |
-//- | -//- | Milan | 1147.9 | 1185.0 | 1158.2 | 1168.3 | 1163.4 | |
-//- | -//- | Milan | 1162.0 | 1180.4 | 1149.3 | 1147.7 | 1157.3 | |
-//- | -//- | Rome | 1358.3 | 1538.3 | 1325.5 | 1338.8 | 1344.8 | |
-//- | -//- | Rome | 1355.8 | 1519.3 | 1338.2 | 1390.1 | 1336.8 | |
-//- | -//- | Rome | 1347.1 | 1395.0 | 1336.0 | 1345.4 | 1338.7 | |
EAGLE_025 | nodes = 2 (16 NUMA regions) | --map_by numa |
Milan | 7546.8 | 7450.0 | 7564.7 | 7202.0 | 7302.3 |
-//- | -//- | Milan | 7508.8 | 7490.4 | 7506.6 | 7416.3 | 7291.3 | |
-//- | -//- | Milan | 7447.6 | 7516.5 | 7548.1 | 7093.1 | 7293.0 | |
-//- | -//- | Rome | 8616.0 | 8660.5 | 8524.5 | 8810.5 | 8309.2 | |
-//- | -//- | Rome | 8652.4 | 8492.7 | 8594.7 | 7955.9 | 8312.3 | |
-//- | -//- | Rome | 8664.1 | 8565.2 | 8621.9 | 7946.7 | 8235.6 | |
EAGLE_050 | nodes = 4 (32 NUMA regions) | --map_by numa |
Milan | 45437.0 | 45714.8 | 45287.9 | 45110.4 | |
-//- | -//- | Milan | 45817.5 | 45128.4 | 45047.3 | 42131.7 | ||
-//- | -//- | Milan | 45483.4 | 45213.9 | 45219.3 | 43263.8 | ||
-//- | -//- | Rome | 51754.3 | 54724.4 | 51907.1 | 51315.1 | ||
-//- | -//- | Rome | 51669.3 | 54213.8 | 51320.5 | 48338.0 | ||
-//- | -//- | Rome | 51689.4 | 53387.0 | 51563.9 | 49702.8 | ||
EAGLE_050 | nodes = 8 (64 NUMA regions) | --map_by numa |
Milan | 36202.3 | 37158.3 | 36111.7 | 37006.0 | |
-//- | -//- | Milan | 36097.3 | 36503.1 | 36228.0 | 37344.7 | ||
-//- | -//- | Milan | 36113.3 | 36155.7 | 36222.0 | 35488.4 | ||
-//- | -//- | Rome | 42012.6 | 41790.7 | 41864.3 | 41723.6 | ||
-//- | -//- | Rome | 41866.8 | 41517.0 | 41772.9 | 40533.6 | ||
-//- | -//- | Rome | 43630.5 | 41419.9 | 41752.4 | 40628.3 |
Note:
Cosma 8 cluster is currently comprised of:
* 360 compute nodes with 1 TB RAM and dual 64-core AMD EPYC 7H12 water-cooled processors at 2.6GHz ("Rome"-type nodes)
* Upgraded with an additional 168 nodes with dual Milan 7763 processors at 2.45GHz ("Milan"-type nodes)
- If you pass the commandline switch
--enable-debugging-checks
to./configure
while building, be prepared for additional logging and dump files. - For large simulations, the amount of data dumps can easily overflow one's quota allowances; for example, the E_100 models can dump files on the order of
1TiBytes
, depending on runtime configuration, which can be specified in a.yml
file. Please check the SWIFT docs. To keep dumping / logging to a local minimum, you can put the following sections / lines, for example, in a fileeagle_100.yml
:
# Parameters governing the snapshots
Snapshots:
select_output_on: 1
select_output: output_list.yml
basename: eagle # Common part of the name of output files
scale_factor_first: 0.91 # Scale-factor of the first snaphot (cosmological run)
time_first: 0.01 # Time of the first output (non-cosmological run) (in internal units)
delta_time: 1.01 # Time difference between consecutive outputs (in internal units)
recording_triggers_part: [1.0227e-4, 1.0227e-5] # Recording starts 100M and 10M years before a snapshot
recording_triggers_bpart: [1.0227e-4, 1.0227e-5] # Recording starts 100M and 10M years before a snapshot
and the contents of output_list.yml
can be
Default:
Standard_Gas: off
Standard_DM: off
Standard_DMBackground: off
Standard_Stars: off
Standard_BH: off
In the eagle_100.yml
we can also specify that we do not want the restart
dumps:
Restarts:
enable: 0
- As seen in the table above the current Scotch implementation is comparable in performance to the ParMETIS (METIS) implementation on problem sizes up to EAGLE_50. However, the current implementation is running into difficulties on the EAGLE_100 testcase. The Scotch partition in this case causes two separate errors: Memory overflow when running across 8 Cosma8 nodes and on 16 Cosma8 nodes the resultant Scotch partition results in certain ranks having greater than 64 MPI proxies which is a hard limit set within Swift. Ongoing work is focused on sorting out this issue.
- The current implementation has only been tested against periodic domains. This is where each vertex in Swift has exactly 26 neighbours. Additions to the edge mean calculation in the
pick_scotch
function will need to be carried out to expand to non-periodic domains.
- Implementing the parallel version PT-Scotch should improve performance across large node count runs.
- Further improvement could be achieved by accurately obtaining the cost to communicate across the resources provided by the scheduler at runtime. The above approach using the pre generated
tleaf
file is an approximation and tools like netloc, which derive from the network fabric representive latency values would be the optimal solution. To begin this would require admin access to run the set up commands to generate an overall graph of the whole HPC. This graph structure is then referenced on run time with the allocated nodes ids to build up an accurate reprensentation of the available compute.