-
Notifications
You must be signed in to change notification settings - Fork 47
Cray XC30 (HLRN III)
The HLRN-III Cray XC30 Systems in Berlin and Hanover are described in detail here: HLRN-III Hardware Summary
In the following a summary of how to compile and achieve good performance will be presented. The situation is complicated by the fact that one has to analyze the combinations of (MPI,OpenMP-MPI,ICC,CRAYCC).
The Cray compiler seems to offer the most painless set-up process because it is part of the default environment and seems to "just work". Whether it provides the fastest code remains to be seen, however!
You first need to compile lime and lemon, the compilation of which goes through without problems by specifying CC=cc
in their configure steps.
There seems to be a peculiarity with the Cray compiler in that passing --disable-omp
causes the code to produce NaN and inf, even though in most cases only MPI parallelization will be used. For now then, we will explicitly enable OpenMP at all times.
configure --enable-alignment=32 --enable-omp --disable-sse2 --disable-sse3 --enable-halfspinor --enable-gaugecopy --with-mpidimension=4 --with-limedir=/home/b/bbpkostr/libs/cray/lime --with-lemondir=/home/b/bbpkostr/libs/cray/lemon --with-lapack=-L/opt/cray/libsci/13.0.1/cray/83/ivybridge/lib CC=cc F77=ftn CFLAGS="-O3 -hsystem_alloc" LDFLAGS="-hsystem_alloc"
Note that (checked 2014.12.18) on the Hannover partition it seems that you need to set:
CFLAGS="-O3 -hsystem_alloc" LDFLAGS="-hsystem_alloc"
It might be that the compiler driver in Berlin set this by default. (Update: 17.02.2015 now this is also true in Berlin...)
Tests have been done mostly on 24^3x48 and 48^3x96 volumes on 64 and 384 nodes respectively. An example job script for a HMC run is given below. Note the usage of the grid_order
command to map the MPI tasks to the machine geometry. This is most important and can have effects up to 25%! Currently it seems that the code from the ETMC master branch, compiled with OpenMP support and run with 2 threads per MPI task gives the highest performance when using the Cray compiler. (at least on these volumes)
The job script also demonstrates how to use OpenMP. Remember that you need to also specify ompnumthreads=2
in the input file.
#!/bin/bash
#PBS -j oe
#PBS -N descriptive_jobname
# use any type of mpp node to reduce queue time, here you could also specify
# mpp2 for higher performance or mpp1 for smaller jobs
#PBS -l feature=mpp
#PBS -l nodes=384:ppn=24
#PBS -l walltime=12:00:00
CONTINUE=1
RUNDIR=$WORK/$PBS_JOBNAME
EXEC=$HOME/code/builds/cray/tmLQCD.master/hmc_tm
# note that for this, a subdirectory "outputs" needs to exist where you ran msub!
OFILE=$PBS_O_WORKDIR/outputs/$PBS_JOBNAME.$PBS_JOBID.out
if [ ! -d $RUNDIR ]; then
mkdir -p $RUNDIR
fi
cd $RUNDIR
cp $PBS_O_WORKDIR/hmc_n384_ppn24.input $RUNDIR/hmc.input
module load perftools
# generate rank ordering for a (TXYZ) 24x16x2x12 parallelization
# using nodes that have two processors with 12 cores each
# for a total of 9216 MPI tasks, one task per core
grid_order -R -c 2,12 -g 24,16,2,12 > MPICH_RANK_ORDER
export MPICH_RANK_REORDER_METHOD=3
export OMP_NUM_THREADS=2
date > $OFILE
# -n 9216 MPI tasks
# -d 2 threads per task ("depth")
# -j 2 (use hyperthreads as well, so each core has two slots)
# you could now also launch 18432 processes with one thread each, by specifying "-d 1"
aprun -n 9216 -d 2 -j 2 $EXEC >> $OFILE
# preserve aprun return value
RVAL=$?
date >> $OFILE
if [[ $RVAL -eq 0 && $CONTINUE -eq 1 ]]; then
cd $PBS_O_WORKDIR
msub n384_ppn24.job
fi
exit $RVAL
Note that a parallelization using 6 threads per MPI task is also possible, with 8 tasks per node in a 2x4 configuration, giving almost comparable performance. For the situation above, one would then have:
# pretend we have nodes with 2 processors and 4 cores each, for a parallelization
# of 24x16x2x4
grid_order -R -c 2,4 -g 24,16,2,4 > MPICH_RANK_ORDER
OMP_NUM_THREADS=6
aprun -n 3072 -d 6 -j 2 $EXEC >> $OFILE
This can be very useful to absorb that annoying factor of 3 and seems to provide a boost of about 25% for lattice sizes which do not include such a factor. In fact, for a 64x128 lattice, this setup gives the best performance:
#PBS -l nodes=512:ppn=24
module load perftools
grid_order -R -c 2,8 -g 32,16,2,8 > MPICH_RANK_ORDER
export OMP_NUM_THREADS=3
export MPICH_RANK_REORDER_METHOD=3
aprun -n 8192 -d 3 -j 2 $EXEC >> $OFILE
This is not the default programming environment so it needs to be loaded first. Further, the Intel compiler is a bit temperamental in that it uses processor specializations throughout and requires the correct craype-[sandybridge/ivybridge/haswell] to be used for the CPU that it will be run on. This means that for running configure, which compiles and runs a number of small test programs, you need to load the CrayPE for Sandybridge. Note that at the time of writing I have not tried out whether Haswell (MPP2) requires the haswell PE (it probably does).
module swap PrgEnv-cray PrgEnv-intel
module swap craype-ivybridge craype-sandybridge
./configure [...]
module swap craype-sandybridge craype-[ivybridge/haswell]
make
After building lime and lemon (keeping in mind that you have to switch PE in the configure, make process), you can move on to configuring tmLQCD.