-
Notifications
You must be signed in to change notification settings - Fork 47
Juropatest
The prototype for the replacement of Juropa can currently be accessed via juropatest.fz-juelich.de
.
The system is equipped with 70 compute nodes, each of which has two 14 (!) core Xeon E5-2695 v3 processors at 2.3 GHz which support FMA and symmetric multithreading. As a result, theoretical peak performance per core is 18.4 GFlop/s (but this is a very synthetic number).
The first step is to load the intel environment which will provide the Intel compiler.
$ module load intel
To configure tmlqcd (in pure MPI mode and with 4D parallelization), we proceed as follows:
$CODEPATH/configure --disable-omp --enable-mpi --with-mpidimension=4 --enable-alignment=32
--with-lapack="-L/usr/local/software/juropatest/Stage1/software/MPI/intel/2015.0.090/impi/5.0.1.035/imkl/11.2.0.090/mkl/lib/intel64
-lmkl_blas95_lp64 -lmkl_avx2 -lmkl_core -lmkl_sequential -lmkl_intel_lp64"
--with-limedir=$yourlimedir --disable-sse2 --disable-sse3 --enable-gaugecopy
--enable-halfspinor CC=mpicc CFLAGS="-fma -axCORE-AVX2 -O3 -std=c99" F77=ifort
and in one line for easy copying:
$CODEPATH/configure --disable-omp --enable-mpi --with-mpidimension=4 --enable-alignment=32 --with-lapack="-L/usr/local/software/juropatest/Stage1/software/MPI/intel/2015.0.090/impi/5.0.1.035/imkl/11.2.0.090/mkl/lib/intel64 -lmkl_blas95_lp64 -lmkl_avx2 -lmkl_core -lmkl_sequential -lmkl_intel_lp64" --with-limedir=$yourlimedir --with-lemondir=$yourlemondir --disable-sse2 --disable-sse3 --enable-gaugecopy --enable-halfspinor CC=mpicc CFLAGS="-fma -axCORE-AVX2 -O3 -std=c99" F77=ifort
The pesky factor of 7 in the number of cores means that it might make sense to use the hybrid code with or without overlapping (to be tested!) of communication and computation. The former does three volume loops and is not necessarily faster because of this overhead.
This mode seems to be somewhat slower due to increased overheads and there has not been any time yet to look into the reasons for why this is the case. In order to use this, you need to grab the InterleavedNDTwistedClover
branch from github.com/urbach/tmLQCD and configure the code like so:
$CODEPATH/configure --enable-omp --enable-mpi --with-mpidimension=4 --enable-alignment=32 --enable-threadoverlap --with-lapack="-L/usr/local/software/juropatest/Stage1/software/MPI/intel/2015.0.090/impi/5.0.1.035/imkl/11.2.0.090/mkl/lib/intel64 -lmkl_blas95_lp64 -lmkl_avx2 -lmkl_core -lmkl_intel_thread -lmkl_intel_lp64" --with-limedir=$YOURLIMEPATH --with-lemondir=$YOURLEMONPATH --disable-sse2 --disable-sse3 --enable-gaugecopy --enable-halfspinor CC=mpicc CFLAGS="-fma -axCORE-AVX2 -O3 -fopenmp -std=c99" F77=ifort LDFLAGS=-fopenmp
For this use the code from the master
branch of github.com/etmc/tmLQCD and configure it like this:
$CODEPATH/configure --enable-omp --enable-mpi --with-mpidimension=4 --enable-alignment=32 --with-lapack="-L/usr/local/software/juropatest/Stage1/software/MPI/intel/2015.0.090/impi/5.0.1.035/imkl/11.2.0.090/mkl/lib/intel64 -lmkl_blas95_lp64 -lmkl_avx2 -lmkl_core -lmkl_intel_thread -lmkl_intel_lp64" --with-limedir=$YOURLIMEPATH --with-lemondir=$YOURLEMONPATH --disable-sse2 --disable-sse3 --enable-gaugecopy --enable-halfspinor CC=mpicc CFLAGS="-fma -axCORE-AVX2 -O3 -fopenmp -std=c99" F77=ifort LDFLAGS=-fopenmp
Each node of this machine can host up to 56 MPI processes (14 * 2 * 2) because of Hyperthreading. Even if you don't have a factor of 7 in your lattice geometry, it might make sense to simply forget about a couple of the threads and run 48 or 55 processes, but it really needs to be tested on a case by case basis which code is fastest. In some cases, however, it might be easiest to absorb the factor of 7 using OpenMP threads and run 2, 4 or 8 processes per node with 28, 14 or 7 threads per process respectively. It must also be tested on a case by case basis whether the overlapping or the non-overlapping code provides the best performance.