Information on how to run SWIFT with Scotch mapping and the test environment used on Cosma 8. Code has been tested with Scotch version 7.0.2.

Last update 18th August 2023.

Obtaining Scotch (as per gitlab repo)

Scotch is publicly available under the CeCILL-C free software license, as described here. The license itself is available here.

To use the lastest version of Scotch, please clone the master branch:

git clone git@gitlab.inria.fr:scotch/scotch.git

Tarballs of the Scotch releases are available here.

Instructions for installing locally on Cosma 8

Environment

    module load cosma/2018 python/3.6.5 intel_comp/2022.1.2 compiler openmpi/4.1.4 fftw/3.3.9 parallel_hdf5/1.12.0 parmetis/4.0.3-64bit metis/5.1.0-64bit gsl/2.5
    module load cmake
    module load bison

Navigate to the Scotch directory and carry out the following commands

    mkdir build 
    cd build
    cmake -DCMAKE_INSTALL_PREFIX=/path-to-install-dir ..
    make -j5
    make install

Configure SWIFT with Scotch

Follow the usual installation instructions but if Scotch installed locally the added --with-scotch=/full/scotch/install/dir/path/ flag will need to be passed to ./configure

There are also two beta-testing Scotch modules available at the time of this writing: (as indicated by the dot before the version number, .7.0.4): You can have the modules environment as

    module load cosma/2018 python/3.6.5 intel_comp/2022.1.2 compiler openmpi/4.1.4 fftw/3.3.9 parallel_hdf5/1.12.0 parmetis/4.0.3-64bit metis/5.1.0-64bit gsl/2.5'

Then,

    module load scotch/.7.0.4-32bit

or

    module load scotch/.7.0.4-64bit

depending on what version of Scotch, 32-bit or 64-bit, you want to use.

Warning on compiler inclusion preferences

You can use the modules environment to load a suitable Scotch module. In Cosma8 you can use module load scotch/.7.0.4-32bit or module load scotch/.7.0.4-64bit to get the 32- or the 64-bit compiled module for Scotch. At the moment of this writing, both 32 and 64 bit versions of Scotch can be loaded together, i.e. there is no exclusion check in the module files. There could be a case where both are needed, for testing or development purposes, for example. Then, care must be taken to avoid unintentional compile time behaviour: the order of the -I/.../... commandline switches that will find their way in the Makefile(s) depends on what the compiler used prefers in the order of inclusion -- so if you have loaded first the 32 bit module, its corresponding -I switch will be placed first, and if the 64 bit module is loaded afterwards in the environment, it will be placed after, etc.

Also, in the case of using a locally-built Scotch package, it is advised that you do not load any Scotch module in advance in the environment, and instead use ./configure with the --with-scotch=/full/scotch/install/dir/path/ in order to make sure that the Makefile will pick up the correct include and library files.

Running with Scotch

Scotch carries out a mapping of a source (or process) graph onto a target (or architecture) graph. The weighted source graph is generated by SWIFT and it captures the computation and communication cost across the computational domain. The target graph defines the communication cost across the available computing architecture. Therefore, to make use of the Scotch mapping alogrithms a target architecture file (target.tgt) must be generated and it should mirror the set up of the cluster being used. Scotch provides optimised architecture files which capture most HPC set ups. For Cosma8 runs we will be targetting NUMA regions therefore we have modelled the architecture as a tleaf structure.

In the following examples it is assumed that one mpi rank is mapped to each Cosma 8 NUMA region. This enforces that cpus-per-task=16 is defined in the SLURM submission script. The Cosma 8 nodes consist of 8 NUMA regions per node, with 4 NUMA regions per socket. Example tleaf files for various setups are given below, where the intrasocket communication cost between NUMA regions is set at 5, intranode but across sockets is set at 10 and the internode cost is set at 1000. These weightings are estimated values but have been shown to give satisfactory results in the testcases explored. An estimate of the relative latency between NUMA regions on a node can be obtained by using hwloc, specifically by using hwloc-distances.

Example tleaf graphs to represent various Cosma8 configurations

Number of nodes	Number of MPI ranks	tleaf
1	2	tleaf 1 2 5
1	8	tleaf 2 2 10 4 5
2	16	tleaf 3 2 1000 2 10 4 5
4	32	tleaf 3 4 1000 2 10 4 5
8	64	tleaf 3 8 1000 2 10 4 5

The first index denotes the number of layers of the tleaf structure and the subsequent index pairs are the number of nodes in a layer and the cost to communicate between them. For example the numbers represented in the 64 MPI rank case (tleaf 3 8 1000 2 10 4 5) are as follows: reading left to right the 3 denotes the number of layers in the leaf structure, 8 denotes the number of vertices in the first layer (which corresponds to 8 compute nodes), 1000 the relative cost for internode communication, 2 denotes the number of vertices in the second layer (number of sockets on each node), 10 relative cost for intersocket communication, 4 denotes the number of vertices in the final layer (number of NUMA regions per socket on cosma8) and finally 5 is the relative cost of intrasocket communication.

The user needs to define this tleaf structure and save it as target.tgt in the directory they will run SWIFT from. Ongoing work focuses on automatically generating this target architecture upon run time.

With OpenMPI the mpirun option --map-by numa has been found to be optimal. This is in contrast to previously suggested --bind-to none on the cosma8 site.

Scotch details

Scotch carries out the mapping using various strategies which are outlined in the documentation, a list of the strategies trialed include: SCOTCH_STRATDEFAULT, SCOTCH_STRATBALANCE, SCOTCH_STRATQUALITY and SCOTCH_STRATSPEED. The Scotch strategy string is passed to the Mapping functions. For the runs carried out here it was found that the global flag SCOTCH_STRATBALANCE and an imbalance ratio of 0.05 worked best. These values are passed to SCOTCH_stratGraphMapBuild.

One issue with Scotch is that when the number of mpi ranks is comparable to the dimensionality of the modelled SWIFT system the optimal mapping strategy doesn't neccessarily map to all available NUMA regions. At present this isn't handled explicity in the code and the paritition reverts to a vectorised or previous partitioning.

The SWIFT edge and vertex weights are estimated in the code, however edge weights are not symmetric - this causes an issue with SWIFT. Therefore, in the SCOTCH Graph the edge weigths are updated to equal to the average value (sum/2) of the two associated edge weights as calculated from SWIFT.

Test Runs

The following results were obtained on cosma8 running with the SCOTCH_STRATBALANCE strategy string and an imbalance ratio of 0.05.

Testcase	Resources	flags	Node types	Scotch32 (s)	Scotch64 (s)	Scotch local (s)	ParMetis (s)	Metis (s)
EAGLE_006	nodes = 1 (8 NUMA regions)	`--map_by numa`	Milan	1191.8	1198.2	1173.6	1167.4	1176.4
	-//-	-//-	Milan	1176.7	1184.4	1193.8	1212.5	1182.1
	-//-	-//-	Milan	1174.5	1183.6	1175.2	1229.4	1180.7
	-//-	-//-	Rome	1368.8	1322.9	1351.5	1332.8	1334.9
	-//-	-//-	Rome	1378.3	1373.8	1353.4	1332.3	1346.8
	-//-	-//-	Rome	1367.3	1395.0	1361.0	1331.2	1330.8

	nodes = 2 (16 NUMA regions)	-//-	Milan	1191.2	1225.8	1167.6	1154.0	1159.7
	-//-	-//-	Milan	1147.9	1185.0	1158.2	1168.3	1163.4
	-//-	-//-	Milan	1162.0	1180.4	1149.3	1147.7	1157.3
	-//-	-//-	Rome	1358.3	1538.3	1325.5	1338.8	1344.8
	-//-	-//-	Rome	1355.8	1519.3	1338.2	1390.1	1336.8
	-//-	-//-	Rome	1347.1	1395.0	1336.0	1345.4	1338.7

EAGLE_025	nodes = 2 (16 NUMA regions)	`--map_by numa`	Milan	7546.8	7450.0	7564.7	7202.0	7302.3
	-//-	-//-	Milan	7508.8	7490.4	7506.6	7416.3	7291.3
	-//-	-//-	Milan	7447.6	7516.5	7548.1	7093.1	7293.0
	-//-	-//-	Rome	8616.0	8660.5	8524.5	8810.5	8309.2
	-//-	-//-	Rome	8652.4	8492.7	8594.7	7955.9	8312.3
	-//-	-//-	Rome	8664.1	8565.2	8621.9	7946.7	8235.6

EAGLE_050	nodes = 4 (32 NUMA regions)	`--map_by numa`	Milan	45437.0	45714.8	45287.9	45110.4
	-//-	-//-	Milan	45817.5	45128.4	45047.3	42131.7
	-//-	-//-	Milan	45483.4	45213.9	45219.3	43263.8
	-//-	-//-	Rome	51754.3	54724.4	51907.1	51315.1
	-//-	-//-	Rome	51669.3	54213.8	51320.5	48338.0
	-//-	-//-	Rome	51689.4	53387.0	51563.9	49702.8

EAGLE_050	nodes = 8 (64 NUMA regions)	`--map_by numa`	Milan	36202.3	37158.3	36111.7	37006.0
	-//-	-//-	Milan	36097.3	36503.1	36228.0	37344.7
	-//-	-//-	Milan	36113.3	36155.7	36222.0	35488.4
	-//-	-//-	Rome	42012.6	41790.7	41864.3	41723.6
	-//-	-//-	Rome	41866.8	41517.0	41772.9	40533.6
	-//-	-//-	Rome	43630.5	41419.9	41752.4	40628.3

Note:

Cosma 8 cluster is currently comprised of:

* 360 compute nodes with 1 TB RAM and dual 64-core AMD EPYC 7H12 water-cooled processors at 2.6GHz ("Rome"-type nodes)
* Upgraded with an additional 168 nodes with dual Milan 7763 processors at 2.45GHz ("Milan"-type nodes)

Managing verbosity, logging info, dumping of intermediate results, debug info, restart files

If you pass the commandline switch --enable-debugging-checks to ./configure while building, be prepared for additional logging and dump files.
For large simulations, the amount of data dumps can easily overflow one's quota allowances; for example, the E_100 models can dump files on the order of 1TiBytes, depending on runtime configuration, which can be specified in a .yml file. Please check the SWIFT docs. To keep dumping / logging to a local minimum, you can put the following sections / lines, for example, in a file eagle_100.yml:

# Parameters governing the snapshots
Snapshots:
  select_output_on: 1
  select_output: output_list.yml
  basename:            eagle # Common part of the name of output files
  scale_factor_first:  0.91  # Scale-factor of the first snaphot (cosmological run)
  time_first:          0.01  # Time of the first output (non-cosmological run) (in internal units)
  delta_time:          1.01  # Time difference between consecutive outputs (in internal units)
  recording_triggers_part: [1.0227e-4, 1.0227e-5]   # Recording starts 100M and 10M years before a snapshot
  recording_triggers_bpart: [1.0227e-4, 1.0227e-5]  # Recording starts 100M and 10M years before a snapshot

and the contents of output_list.yml can be

Default:
  Standard_Gas: off
  Standard_DM: off
  Standard_DMBackground: off
  Standard_Stars: off
  Standard_BH: off

In the eagle_100.yml we can also specify that we do not want the restart dumps:

Restarts:
  enable:    0

Current Limitations

As seen in the table above the current Scotch implementation is comparable in performance to the ParMETIS (METIS) implementation on problem sizes up to EAGLE_50. However, the current implementation is running into difficulties on the EAGLE_100 testcase. The Scotch partition in this case causes two separate errors: Memory overflow when running across 8 Cosma8 nodes and on 16 Cosma8 nodes the resultant Scotch partition results in certain ranks having greater than 64 MPI proxies which is a hard limit set within Swift. Ongoing work is focused on sorting out this issue.
The current implementation has only been tested against periodic domains. This is where each vertex in Swift has exactly 26 neighbours. Additions to the edge mean calculation in the pick_scotch function will need to be carried out to expand to non-periodic domains.

Notes

Implementing the parallel version PT-Scotch should improve performance across large node count runs.
Further improvement could be achieved by accurately obtaining the cost to communicate across the resources provided by the scheduler at runtime. The above approach using the pre generated tleaf file is an approximation and tools like netloc, which derive from the network fabric representive latency values would be the optimal solution. To begin this would require admin access to run the set up commands to generate an overall graph of the whole HPC. This graph structure is then referenced on run time with the allocated nodes ids to build up an accurate reprensentation of the available compute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README-Scotch.md

README-Scotch.md

Obtaining Scotch (as per gitlab repo)

Instructions for installing locally on Cosma 8

Configure SWIFT with Scotch

Running with Scotch

Scotch details

Test Runs

Managing verbosity, logging info, dumping of intermediate results, debug info, restart files

Current Limitations

Notes

Files

README-Scotch.md

Latest commit

History

README-Scotch.md

File metadata and controls

Obtaining Scotch (as per gitlab repo)

Instructions for installing locally on Cosma 8

Configure SWIFT with Scotch

Running with Scotch

Scotch details

Test Runs

Managing verbosity, logging info, dumping of intermediate results, debug info, restart files

Current Limitations

Notes