-
Notifications
You must be signed in to change notification settings - Fork 0
License
cresta-eu/CAF_BENCH
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
!----------------------------------------------------------------------------! ! ! ! Fortran Coarray Micro-Benchmark Suite - Version 1.0 ! ! ! ! David Henty, EPCC; [email protected] ! ! ! ! Copyright 2013 the University of Edinburgh ! ! ! ! Licensed under the Apache License, Version 2.0 (the "License"); ! ! you may not use this file except in compliance with the License. ! ! You may obtain a copy of the License at ! ! ! ! http://www.apache.org/licenses/LICENSE-2.0 ! ! ! ! Unless required by applicable law or agreed to in writing, software ! ! distributed under the License is distributed on an "AS IS" BASIS, ! ! WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ! ! See the License for the specific language governing permissions and ! ! limitations under the License. ! ! ! !----------------------------------------------------------------------------! License ------- This software is released under the license in "LICENSE.txt". References ---------- D. Henty, "Performance of Fortran Coarrays on the Cray XE6", in Proceedings of Cray User Group 2012. https://cug.org/proceedings/attendee_program_cug2012/includes/files/pap181.pdf D. Henty, "A Parallel Benchmark Suite for Fortran Coarrays", Applications, Tools and Techniques on the Road to Exascale Computing, (IOS Press, 2012), pp. 281-288. Introduction ------------ This set of benchmarks aims to measure the performance of various parallel operations involving Fortran coarrays. These include point-to-point ("ping-pong") data transfer patterns, synchronisation patterns and halo-swapping for 3D arrays. Installation ------------ o Unpack the tar file. o Select the required benchmarks by editing "cafparams.f90". o Compile using "make". The supplied Makefile is configured for the Cray compiler - you will have to set "FC", "FFLAGS". "LDFLAGS" and "LIBS" appropriately for a different compiler. Note that the benchmark uses MPI as well as Fortran coarrays. Execution --------- The executable "cafbench" runs stand-alone without any flags or input files. You will have to launch it as appropriate on your parallel system, eg on a Cray: "aprun -N <numimages> ./cafbench". Benchmarks ---------- The benchmark has three separate sections: o Point-to-point reports the latency and bandwidth (including any synchronisation overheads). o Synchronisation reports the overhead by performing calculations with and without synchronisation and subtracting the two times. o Halo reports the time and bandwidth for regular halo swapping in a 3D pattern. In all cases the basic data types are double precision numbers. Point-to-point notes -------------------- The point-to-point benchmarks use both remote read and remote write. All data patterns are characterised by three parameters: count, blksize and stride. The data transferred is "count" separate blocks each of size "blksize", separated by "stride". We also print out "ndata" (the amount of data actually sent, ie count*blksize) and "nextent" which is the distance between the first and last data items (which is larger than "ndata" for strided patterns). All data arrays contain double precision numbers. The same pattern may often be realised in several different ways (eg inline or via a subroutine) to test the robustness of the compiler. This might seem unnecessary, but in the early compiler releases seemingly similar expressions have given very different performance, eg x(1:ndata) was much slower than x(:). The word "many" indicates more than one remote operation or more than one call to a subroutine (subroutines are indicated by "sub"), although a good compiler may merge these in a single operation. Synchronisation is included in the timings. Many repetitions are done to get sensible results and the sync is also done many times. Synchronisation can be global (sync all) or point-to-point (sync images). Pingpongs are done in three ways: "Single ping-pong" between images 1 and numimages. All other images are idle (except that they must call "sync all" for global synchronisation). "Multiple ping-pong" where _all_ images are active. Image "i" is paired with image "i+numimages/2"; note this only takes place for even numbers of images greater than 2. This can give significantly different bandwidths depending on the choice of synchronisation (see "crossing" below). "Multiple crossing ping-pong" which is as above but every other pair swaps in the opposite direction, ie if image 1 is sending to 1+numimages/2, then image 2 is receiving from 2+numimages/2. This ensures that we exploit the bidirectional bandwidth. Note that in practice this is the same as "mutiple" if you use point-to-point synchronisation: in that case the pairs of images naturally get out of sync as they contend for bandwidth. However, for global synchronisation this realises a different pattern from "multiple". This test is really to explain any differences in "multiple" for different choices of synchronisation. The patterns are as follows - all except "MPI Send" are replicated for get (remote read): "put" Contiguous put done inline: x(1:ndata)[image2] = x(1:ndata) "subput" Contiguous put done via a subroutine with target = source = x and disp = 1, count = ndata, ie: target(disp:disp+count-1)[image] = source(disp:disp+count-1) "simple subput" As above but with simpler arguments to subroutine: target(:)[image] = source(:) "all put" Arrays are allocated to be of size ndata and simple call is done inline. This is like "simple subput" above except there the arrays are implicitly resized via a subroutine call. Code is: x(:)[image2] = x(:) "many put" A contiguous put done as many separate puts of different blksize: do i = 1, count x(1+(i-1)*blksize:i*blksize)[image2] = & x(1+(i-1)*blksize:i*blksize) "sub manyput" Exactly as "many put" but done in a separate subroutine: do i = 1, count target(disp+(i-1)*blksize:disp+i*blksize-1)[image] = & source(disp+(i-1)*blksize:disp+i*blksize-1) "many subput" Same pattern but with many separate invocations of "subput": do i = 1, count call cafput(x, x, 1+(i-1)*blksize, blksize, image1) "strided put" Strided data done inline in the code: x(1:nextent:stride)[image2] = x(1:nextent:stride) "strided subput" As above but done via a subroutine: target(istart:istop:stride)[image] = source(istart:istop:stride) "strided many put" The most complex pattern: strided but with blocks larger than a single unit. Pattern is a block of data, followed by a gap of the same size, repeated: do i = 1, count x(1+2*(i-1)*blksize:(2*i-1)*blksize)[image2] = & x(1+2*(i-1)*blksize:(2*i-1)*blksize) This pattern is a useful measurement in cases where the compiler vectorises "many put" into a single put of size ndata. "MPI Send" A regular MPI ping-pong with no coarray synchronisation, done as a sanity check for the coarray performance numbers. Synchronisation notes --------------------- The different synchronisation types are: o sync all: a simple call to "sync all". o sync mpi barrier: MPI call for comparison with "sync all" above. o sync images pairwise: each image calls "sync images" with a single neighbour; images are paired up in the same pattern as for "Multiple ping-pong" above. o sync images random: each images calls "sync images" with N neighbours chosen randomly (to ensure that they all match up I actually set up a simple ring pattern then randomly permute). N is chosen as 2, 4, 6, ... syncmaxneigh, capped if this starts to exceed the total number of images. Default syncmaxneigh is 12. o sync images ring: each image calls "sync images" with N neighbours paired as image +/- 1, image +/- 2 ... image +/- syncmaxneigh/2 with periodic boundary conditions. o sync images 3d grid: each image calls "sync images" with 6 neighbours which are chosen as the up and down neighbours in all directions in a 3D cartesian grid (with periodic boundaries). The dimensions of the 3D grid are selected via a call to MPI_Cart_dims (suitably reversed for Fortran indexing). This is precisely the synchronisation pattern used in the subsequent halo benchmark. o sync lock: all images lock a variable on image 1. o sync critical: all images execute a critical region. Note that in all of these the time for some computation (a simple delay loop) is compared to the time for the computation plus synchronisation, and these are subtracted to get the synchronisation overhead.
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published