Skip to content

Cray XC30 (HLRN III)

Bartosz Kostrzewa edited this page Jun 11, 2015 · 4 revisions

The HLRN-III Cray XC30 Systems in Berlin and Hanover are described in detail here: HLRN-III Hardware Summary

In the following a summary of how to compile and achieve good performance will be presented. The situation is complicated by the fact that one has to analyze the combinations of (MPI,OpenMP-MPI,ICC,CRAYCC).

CRAYCC

The Cray compiler seems to offer the most painless set-up process because it is part of the default environment and seems to "just work". Whether it provides the fastest code remains to be seen, however!

Compiling

You first need to compile lime and lemon, the compilation of which goes through without problems by specifying CC=cc in their configure steps.

Pure MPI and Hybrid

There seems to be a peculiarity with the Cray compiler in that passing --disable-omp causes the code to produce NaN and inf, even though in most cases only MPI parallelization will be used. For now then, we will explicitly enable OpenMP at all times.

configure --enable-alignment=32 --enable-omp --disable-sse2 --disable-sse3 --enable-halfspinor --enable-gaugecopy --with-mpidimension=4 --with-limedir=/home/b/bbpkostr/libs/cray/lime --with-lemondir=/home/b/bbpkostr/libs/cray/lemon --with-lapack=-L/opt/cray/libsci/13.0.1/cray/83/ivybridge/lib CC=cc F77=ftn CFLAGS="-O3 -hsystem_alloc" LDFLAGS="-hsystem_alloc"

Note that (checked 2014.12.18) on the Hannover partition it seems that you need to set:

CFLAGS="-O3 -hsystem_alloc" LDFLAGS="-hsystem_alloc"

It might be that the compiler driver in Berlin set this by default. (Update: 17.02.2015 now this is also true in Berlin...)

Running

Tests have been done mostly on 24^3x48 and 48^3x96 volumes on 64 and 384 nodes respectively. An example job script for a HMC run is given below. Note the usage of the grid_order command to map the MPI tasks to the machine geometry. This is most important and can have effects up to 25%! Currently it seems that the code from the ETMC master branch, compiled with OpenMP support and run with 2 threads per MPI task gives the highest performance when using the Cray compiler. (at least on these volumes)

The job script also demonstrates how to use OpenMP. Remember that you need to also specify ompnumthreads=2 in the input file.

#!/bin/bash
#PBS -j oe
#PBS -N descriptive_jobname
# use any type of mpp node to reduce queue time, here you could also specify
# mpp2 for higher performance or mpp1 for smaller jobs
#PBS -l feature=mpp
#PBS -l nodes=384:ppn=24
#PBS -l walltime=12:00:00

CONTINUE=1
RUNDIR=$WORK/$PBS_JOBNAME
EXEC=$HOME/code/builds/cray/tmLQCD.master/hmc_tm
# note that for this, a subdirectory "outputs" needs to exist where you ran msub!
OFILE=$PBS_O_WORKDIR/outputs/$PBS_JOBNAME.$PBS_JOBID.out

if [ ! -d $RUNDIR ]; then
  mkdir -p $RUNDIR
fi

cd $RUNDIR

cp $PBS_O_WORKDIR/hmc_n384_ppn24.input $RUNDIR/hmc.input

module load perftools
# generate rank ordering for a (TXYZ) 24x16x2x12 parallelization
# using nodes that have two processors with 12 cores each
# for a total of 9216 MPI tasks, one task per core
grid_order -R -c 2,12 -g 24,16,2,12 > MPICH_RANK_ORDER
export MPICH_RANK_REORDER_METHOD=3
export OMP_NUM_THREADS=2

date > $OFILE
# -n 9216 MPI tasks 
# -d 2 threads per task ("depth")
# -j 2 (use hyperthreads as well, so each core has two slots)
# you could now also launch 18432 processes with one thread each, by specifying "-d 1"
aprun -n 9216 -d 2 -j 2 $EXEC >> $OFILE
# preserve aprun return value
RVAL=$?
date >> $OFILE

if [[ $RVAL -eq 0 && $CONTINUE -eq 1 ]]; then
  cd $PBS_O_WORKDIR
  msub n384_ppn24.job
fi

exit $RVAL

Note that a parallelization using 6 threads per MPI task is also possible, with 8 tasks per node in a 2x4 configuration, giving almost comparable performance. For the situation above, one would then have:

# pretend we have nodes with 2 processors and 4 cores each, for a parallelization
# of 24x16x2x4
grid_order -R -c 2,4 -g 24,16,2,4 > MPICH_RANK_ORDER
OMP_NUM_THREADS=6
aprun -n 3072 -d 6 -j 2 $EXEC >> $OFILE

This can be very useful to absorb that annoying factor of 3 and seems to provide a boost of about 25% for lattice sizes which do not include such a factor. In fact, for a 64x128 lattice, this setup gives the best performance:

#PBS -l nodes=512:ppn=24
module load perftools
grid_order -R -c 2,8 -g 32,16,2,8 > MPICH_RANK_ORDER
export OMP_NUM_THREADS=3
export MPICH_RANK_REORDER_METHOD=3
aprun -n 8192 -d 3 -j 2 $EXEC >> $OFILE

ICC

Compiling

This is not the default programming environment so it needs to be loaded first. Further, the Intel compiler is a bit temperamental in that it uses processor specializations throughout and requires the correct craype-[sandybridge/ivybridge/haswell] to be used for the CPU that it will be run on. This means that for running configure, which compiles and runs a number of small test programs, you need to load the CrayPE for Sandybridge. Note that at the time of writing I have not tried out whether Haswell (MPP2) requires the haswell PE (it probably does).

module swap PrgEnv-cray PrgEnv-intel
module swap craype-ivybridge craype-sandybridge
./configure [...]
module swap craype-sandybridge craype-[ivybridge/haswell]
make

After building lime and lemon (keeping in mind that you have to switch PE in the configure, make process), you can move on to configuring tmLQCD.

Pure MPI

GCC