Skip to content

Blue Gene Performance (old)

urbach edited this page Feb 3, 2013 · 1 revision

Blue Gene Performance

Dependence on the local volume

I just did a series of runs on bg_size = 64, performance measured per core, V_L is the core-local volume, 4 threads per MPI process, 16 MPI processes per node

  • V_L: 12 x 4 x 4 x 4 ; nocomm: 2129 Mflops; comm: 693 Mflops
  • V_L: 12 x 8 x 4 x 4 ; nocomm: 1996 Mflops; comm: 812 Mflops
  • V_L: 12 x 8 x 8 x 8 ; nocomm: 1144 Mflops; comm: 823 Mflops
  • V_L: 3 x 8 x 16 x 16 ; nocomm: 1046 Mflops; comm: 620 Mflops
  • V_L: 4 x 8 x 16 x 16 ; nocomm: 1125 Mflops; comm: 770 Mflops
  • V_L: 8 x 8 x 16 x 16 ; nocomm: 0977 Mflops ; comm: 811 Mflops
  • V_L: 16 x 16 x 16 x 8; nocomm: 0984 Mflops ; comm: 776 Mflops
  • V_L: 6 x 16 x 8 x 8 ; nocomm: 1135 Mflops; comm: 838 Mflops

now I'm doing the same with halfspinor and get

  • V_L: 12 x 4 x 4 x 4; nocomm: 2321 Mflops; comm: 1395 Mflops
  • V_L: 12 x 8 x 4 x 4; nocomm: 2127 Mflops; comm: 1410 Mflops

For global volume V = 48^3x96 on bg_size=256,512,1024 I obtain:

  • V_L: 12x6x6x6; nocomm: 1040 Mflops; comm: 786 Mflops
  • V_L: 6x6x6x6: nocomm: 2169 Mflops; comm: 1413 Mflops
  • V_L: 6x6x6x6; nocomm: 2174 Mflops; comm: 1410 Mflops [MESH]
  • V_L: 6x6x6x6; nocomm: 2438 Mflops; comm: 1703 Mflops [sloppy precision]
  • V_L: 3x6x6x6; nocomm: 2152 Mflops; comm: 1059 Mflops

For global volume V = 64^3x128 on bg_size=512,1024,2048 I obtain:

  • V_L = 32x8x4x4; nocomm: 891 Mflops; comm: 688 Mflops
  • V_L = 4x8x8x8; nocomm: 1661 Mflops; comm: 1102 Mflops
  • V_L = 16x8x4x4: nocomm: 1569 Mflops; comm: 1047 Mflops
  • V_L = 8x8x4x4: nocomm: 2044 Mflops; comm: 1236 Mflops
  • V_L = 4x4x8x8; nocomm: 2068 Mflops; comm: 965 Mflops
  • V_L = 8x4x8x4; nocomm: 2049 Mflops; comm: 1032 Mflops

OpenMP on node, MPI between nodes, halfspinor V = 24^3 x 48

  • V_L = 12x12x12x12; bg_size=32, 32 MPI; nocomm: 2132 Mflops; comm: 1508 Mflops

  • V_L = 12x12x12x12; bg_size=32, 32 MPI; nocomm: 2236 Mflops; comm: 1620 Mflops, 92c4e1fd829316130b5706fa170c172e50396ade

  • Okay, wow, +120 Mflops just from fusing...! Maybe Francescos version is worth a try!? Unfortunately I cannot reproduce it on a 48^3x96 lattice with the same local volume. I only get 1274 Mflops per core (2257 w/o communication).

  • Well, seeing that the nocomm result is the same we're hitting the communication limit there. -Bartek

  • but the ammount to be communicated is the same...!?

  • hmm, now I get the same as you also on the bigger volume, so maybe its a mapping thing?

OpenMP on node, MPI between nodes, halfspinor, compare branch NewHopping

  • V_L = 12x12x12x12; bg_size=512, 512 MPI; nocomm: 2142 Mflops; comm: 1576 Mflops [NewHopping]
  • V_L = 12x12x12x12; bg_size=512, 512 MPI; nocomm: 2250 Mflops; comm: 1645 Mflops
  • V_L = 12x12x12x12; bg_size=32, 32 MPI; nocomm: 2213 Mflops; comm: 1741 Mflops [SPI, no interleaving]
  • V_L = 12x12x12x12; bg_size=32, 32 MPI; nocomm: 2148 Mflops; comm: 2093 Mflops [NewHopping+SPI interleaving]

Unfortunately, the code is not yet quite correct, i.e. solver doesn't converge at the moment in my current NewHopping branch. There are improvements possible, for instance doing only the projection in the first round (1 \pm \gamma_\mu) and all the gauge fields in the second round. That would give more time to overlap communication and computation.

Scaling Plot Hopping Matrix

scaling plot

CG Performance

CG performance is quite a bit lower than the Hopping Matrix performance. Fusing (tm - Heo 1/tm Hoe) into only two loops helped a bit. For my testcase the performance increased from 863.4 (916) Mflops to 913.9 (980) Mflops with a local 6 x 6 x 6 x 6 lattice and global 24^3x48 (in parethesis with sloppy precision). This is available in the latest bgq_omp branch.

General performance numbers

Just to collect the current performance measures on the BG/Q we are seeing: We use

  • a volume of 64^4
  • bg_size = 32

Hybrid version:

  • using 4 threads per MPI process
  • NrXProcs=8, NrYProcs=8, NrZProcs=2

Numbers are quoted per node:

  1. MPI: 8704 Mflops
  2. MPI + intrinsics: 12160 Mflops
  3. MPI + OMP: 10048 Mflops
  4. MPI + OMP + XLC prefetch: 12160 Mflops
  5. MPI + OMP + intrinsics: 14528 Mflops

Note that the mapping of the machine was not yet optimised.

Without communication the hybrid version reaches around 15800 Mflops while a pure OpenMP run maxes out at 16260-16400, explaining (I guess) why Peter Boyle went for 1 process per node. In that case, however, the exchanged boundaries are so large that it becomes necessary to interleave as the difference is over 100 Mflops per thread. (254 vs 152)

Francesco

On JUQUEEN

I compiled with

--with-alignment=32 --without-bgldram --with-limedir=/work/pra067/pra06700/juqueen/programs/lime_c --enable-mpi --enable-qpx --with-mpidimension=4 --enable-omp --enable-gaugecopy --disable-halfspinor --enable-largefile --with-lapack="-L/bgsys/local/lib/ -L/usr/local/bg_soft/lapack/3.3.0/lib -lesslbg -llapack -lesslbg -lxlf90_r -L/opt/ibmcmp/xlf/bg/14.1/lib64 -lxl -lxlopt -lxlf90_r -lxlfmath -L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 -lxlsmp -lpthread" CC=/bgsys/drivers/ppcfloor/comm/xl/bin/mpixlc_r CFLAGS="-I/bgsys/drivers/ppcfloor/arch/include/ -I/bgsys/drivers/ppcfloor/comm/xl/include -O5 -qprefetch=aggressive -qarch=qp -qtune=qp -qmaxmem=-1 -qsimd=noauto -qsmp=noauto -qstrict=all -DBGQ" F77=bgf77 LDFLAGS="-L/opt/ibmcmp/xlf/bg/14.1/lib64 -L/usr/local/bg_soft/lapack/3.3.0 -lxl -lxlopt -lxlf90_r -L/bgsys/drivers/ppcfloor/bgpm/lib/ -lxlfmath -L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 -lxlsmp -lpthread -L/bgsys/ibm_essl/prod/opt/ibmmath/lib64" FC=bgxlf_r

Actually I see that the compiler use -O3.

I obtain 770 Mflops per rank, i.e. 12320 Mflops per node.

On FERMI

I compiled using:

module load bgq-xl essl lapack blas

../configure --with-alignment=32 --without-bgldram --with-limedir=/gpfs/scratch/userexternal/fsanfili/programs/lime --enable-mpi --enable-qpx --with-mpidimension=4 --enable-omp --enable-gaugecopy --disable-halfspinor --enable-largefile --with-lapack="-lesslbg -llapack -lxl -lblas -L$BLAS_LIB -lxlf90_r -L/opt/ibmcmp/xlf/bg/14.1/lib64/ -lrt -L$ESSL_LIB -L$LAPACK_LIB -lxlopt -lxlfmath -lxlsmp -lpthread -lxlf90_r" CC=mpixlc_r CFLAGS="-O5 -qarch=qp -qtune=qp -qmaxmem=-1 -qsimd=noauto -qsmp=noauto -qstrict=all -DBGQ" F77=bgf77 LDFLAGS="-lxl -lxlopt -lxlsmp -lpthread" FC=bgxlf_r

Performance are the same of JUQUEEN: 771 Mflops per rank.