-
Notifications
You must be signed in to change notification settings - Fork 47
SuperMUC docu
As far as I understand, thin nodes are running the Sandy Bridge-EP Intel Xeon E5-2680 8C processor, which does support AVX, but not the fused multiply add operations (FMA). The fat nodes don't support AVX, but SSE4. The login nodes seem to be E5-2680, and indeed I get illegal instructions when using fma instructions.
I configured the latest bgq_omp
branch of my tmLQCD fork with the following options, after module load mkl
, for a mixed MPI + openMP executable. I obtain best results when using the intel mpi library
module unload mpi.ibm
module load mpi.intel/4.0
and then configure with
configure --enable-mpi --with-mpidimension=4 --with-limedir="$HOME/cu/head/lime-1.3.2/" --disable-sse2 --with-alignment=32 --disable-sse3 --with-lapack="$MKL_LIB" --disable-halfspinor --disable-shmem CC=mpicc CFLAGS="-O3 -axAVX -openmp" F77="ifort"
Here is an example for loadleveler job file with IBM MPI
#!/bin/bash
#
#@ job_type = parallel
#@ class = large
#@ node = 8
#@ total_tasks= 128
#@ island_count=1
### other example
##@ tasks_per_node = 16
##@ island_count = 1,18
#@ wall_clock_limit = 0:15:00
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/cu/head/testrun/
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ [email protected]
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
export MP_SINGLE_THREAD=no
export OMP_NUM_THREADS=2
# Pinning
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
mpiexec -n 128 ./benchmark
for Intel MPI library use (note the different number of tasks per node and hence the different number of OMP threads!)
#!/bin/bash
#@ job_type=MPICH
#@ network.MPI=sn_all,not_shared,us
#@ class = large
#@ job_name=mytest
#@ output = $(job_name).$(jobid).out
#@ error = $(job_name).$(jobid).err
#@ node_topology = island
#@ wall_clock_limit = 0:15:00
#@ node=16
#@ tasks_per_node=2
#@ initialdir = $(HOME)/head/testrun/
#@ notification=always
#@ [email protected]
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
module load prace
module load mkl
module unload mpi.ibm
module load mpi.intel/4.0
### OpenMP variables
export OMP_NUM_THREADS=8
export OMP_DYNAMIC=FALSE
export SPINLOOPTIME=5000
export YIELDLOOPTIME=5000
export OMP_SCHEDULE=STATIC
## Intel MPI variables
export I_MPI_FABRICS=shm:ofa
export I_MPI_FALLBACK=0
export I_MPI_JOB_FAST_STARTUP=enable
mpiexec -n 32 /home/hpc/pr86ve/pr3da094/head/testrun/benchmark
It is expected that pure MPI will perform best on this machine but it is worth trying a 2x8 hybrid approach with 2 processes per node and 8 threads per process. (as each node has two processors with 8 cores each) It would perhaps even be beneficial to run 16 threads per process because the E5-2680 supports two hardware threads per core.
From the Zeuthen cluster - which has older CPUs clocked at similar clockrates - it is expected that this machine should provide a performance of around 3.5 GFlops per core with communication and around 5 GFlops per core without. It is also expected that the halfspinor version of the code will underperform "node-locally" but overperform with many MPI processes and non-node-local communication. In a pure MPI approach the halfspinor version should be faster always. (values for problems fitting into the cache)
These CPUs have a large cache of 20MB so a node-local volume of up to 12^4 should not be a problem at all, giving OpenMP a lot of time to manage threads as the loops are quite large in such a configuration.
The performance I got from a benchmark
run with 128 task with 2 threads each is with a 24^3x48 lattice is (local lattice size is 24x6x6x6)
- A comment on the local lattice size: the CPU has 20MB L2 cache and you're running 8 processes per CPU if I understand correctly. Therefore even your gauge field won't fit in the cache. Better to decrease the local lattice size by a factor of 2.
- Also, it is worth testing whether running 8 threads per task, two tasks per node wouldn't be faster.
- Finally, the OpenMP overhead might be so large on Intel that it makes more sense to simply run two processes per core!
- 1193 Mflops per core with communication
- 2443 Mflops per core without communication
with 256 tasks
- 1294 Mflops Mflops per core with communication
- 2473 Mflops per core without communication
with 256 tasks and 4 threads each gives 15 Mflops only, so better use 2 threads per core.
- I have a feeling that hyperthreading is disabled with a bios switch. On hyperthreadig machines I always see a ~10% gain from thread oversubscription so that's a bit of a shame. Also, interleaving should work much better with more threads. -Bartek (2013/05/08)
Using --with-alignment=32 -axAVX
performance is better (on 256 tasks again):
- 1455 Mflops Mflops per core with communication
- 3029 Mflops per core without communication
Using halfspinor
gives again better performance
- 1813 Mflops Mflops per core with communication
- 2926 Mflops per core without communication
Don't know what this is in % of peak performance right now.
Addendum (2013/05/08, Bartek) halfspinor, pure MPI, 24^3x48, 512 tasks on 32 nodes, IBM MPI:
- 2398 Mflop/s per core w/comm
- 3831 Mflop/s per core w/o comm
Newer performance data (7.12.2012) with halfspinor, OMP_NUM_THREADS=2
, 24^3x48, 256 tasks on 16 nodes:
- 2274 Mflops per core with communication
- 3547 Mflops per core without communication
I tested -avSSE4.2
, which makes the code slower. I also tested different affinity options, but MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
appears to be by far the fastest. Performance drops to 30 Mflops otherwise. Also using gcc
instead of icc
currently leads to 12 Mflops per core...
Even newer (10.12.2012) with halfspinor, `OMP_NUM_THREADS=8, 24^3x48, 16 nodes, 2 tasks per node with Intel MPI library (local volume 12^4, 32 MPI processes)
- 16245 Mflops per MPI task, 2030 Mflops per thread (w/ comm)
- 27581 Mflops per MPI task, 3447 Mflops per thread (w/o comm)
and with halfspinor, `OMP_NUM_THREADS=8, 16^3x32, 16 nodes, 2 tasks per node with Intel MPI library (local volume 8^4, 32 MPI processes)
- 13413 Mflops per MPI task, 1676 Mflops per thread (w/ comm)
- 31101 Mflops per MPI task, 3887 Mflops per thread (w/o comm)
Interleaving communication and computation should help a lot, however it seems to be not working well, for the same parameteres as before (24^3x48) , but with interleaving I get
- 13866 Mflops per MPI task, 1733 Mflops per threads (w/ comm)
- 24206 Mflops per MPI task, 3025 Mflops per threads (w/o comm)
so, there is large potential, if we get the MPI lib to do a proper non-blocking send/recv.
Using 1 MPI process per node, i.e. OMP_NUM_THREADS=16
, 16 nodes, 1 task per node for 24^3x48
- 24084 Mflops per MPI task, 1505 Mflops per thread (w/ comm)
- 33256 Mflops per MPI task, 2078 Mflops per thread (w/o comm)
so slightly slower...
Addendum (2013/05/08, Bartek), 8 threads per task, 2 tasks per node, MPI_thread_overlap branch (based on InterleavedNDTwistedClover), 24^3x48, 64 tasks, 32 nodes, IBM MPI:
- 1940 Mflop/s per thread w/ comm
- 3082 Mflop/s per thread w/o comm
The E5-2680 is a 2 hardware SMT per core design so this is not surprising.
Does --with-alignment=32 help? (we still need to make all the alignments independent of SSE2/SSE3 being defined)
On the machines in Zeuthen which are similar in clockspeed (2.67GHz but: 4 cores, 2 SMT, no AVX), I get over 5000 Mflops (nocomm) per core using the full-spinor code and over 4800 Mflops per core with half-spinor. I cannot look at MPI scaling but the local volume was the same during this test. The fact that your new half-spinor version is so fast is truly remarkable!
It was only a first shot, so I should also try alignment=32|64. Maybe also the other optimisation options, like no AVX and with AVXXX...
Also you might have much better luck with 2 tasks per node and 8 or even 16 threads per task. (since the nodes have - AFAIK - 2 sockets and each CPU has 8 cores with dual SMT)
Some more comments relating to poor performance with OpenMP. On systems with more than one socket it is extremely important to "pin" threads to the CPU, otherwise cross-socket memory access will occur and this is slow (by up to a factor of 2). I don't know whether this is done automatically on SuperMUC or whether one has to proceed like on other machines and have a wrapper script which is called by mpirun before the application runs, defining some constants. In particular, KMP_AFFINITY
needs to be set appropriately for each process separately. This should be discussed with the people responsible or looked up in the docu.
../../tmLQCD/configure --prefix=/home/hpc/pr63po/lu64qov2/build/hmc_supermuc_mpi/ --enable-mpi --with-mpidimension=4 --enable-gaugecopy --enable-halfspinor --with-alignment=32 --disable-sse2 --enable-sse3 --with-limedir=/home/hlrb2/pr63po/lu64qov2/build/lime_supermuc/install CC=mpicc CFLAGS="-O3 -axAVX" --with-lapack="$MKL_LIB" F77=ifort
24^3x48 lattice, 8x8x2x2 parallelization, ompnumthreads=2
--enable-omp --disable-sse3 CC="-axAVX" comm: 1502 Mflops, nocomm: 2733 Mflops
--enable-omp --enable-sse3 CC="-axAVX" comm: 1562 Mflops, nocomm: 2784 Mflops
--enable-omp --enable-sse3 comm: 1435 Mflops, nocomm: 2277 Mflops
--disable-omp --enable-sse3 CC="-axAVX" comm: 1575 Mflops, nocomm: 2824 Mflops