Skip to content
Bartosz Kostrzewa edited this page Aug 27, 2018 · 58 revisions

This document describes how to set up a special test version of tmLQCD linked against QUDA and another one linked against QPHIX to perform benchmarks and scaling tests.

Table of contents

  1. tmLQCD dependencies
  2. Test configurations
  3. QUDA
    1. Compiling QUDA
    2. Compiling QUDA-enabled tmLQCD
    3. Running QUDA-enabled tmLQCD
  4. QPhiX (out of date)
    1. Compiling QMP
    2. Compiling QPhiX
    3. Compiling QPhiX-enabled tmLQCD
    4. Running QPhiX-enabled tmLQCD
  5. Input files

tmLQCD dependencies

tmLQCD uses the lemon library, https://github.com/etmc/lemon for parallel I/O. In addition, the lime library is required: https://github.com/usqcd-software/c-lime.

Note that for lemon an MPI compiler is required and this should always be matched with the MPI compiler used for tmLQCD and any libraries it depends on.

These can be compiled using the usual configure/make steps and should be installed into local directories using --prefix and make install. Note that two versions should be compiled: one using GCC for QUDA-enabled builds and one using ICC for QPHIX-enabled build.

Test configurations

The test configurations for these benchmarks are ETMC 2+1+1 configurations at two lattice spacings (~0.09 fm and ~0.065 fm) and have 32c64 and 48c96 lattice points respectively.

Ensemble configurations kappa_c a mu
A30.32 1000, 1200, 1400, 1600, 1800 0.163272 0.0030
D15.48 100, 200, 300, 400, 500, 600 0.156361 0.0015

QUDA

This test was set up on commit 923a8a6592a9f50b135dd8e9419c969432249a0e of the feature/multigrid branch of http://github.com/lattice/quda.

In any case, in order to maximise QUDA performance at scale, the recommendations of https://github.com/lattice/quda/wiki/Multi-GPU-Support should be followed. The multigrid solver is documented here https://github.com/lattice/quda/wiki/Multigrid-Solver instead.

Compiling QUDA

QUDA can currently not be compiled using the Intel compiler, we are thus proceeding with GCC. For concreteness, these instructions are for Jureca.

 $ module list

Currently Loaded Modules:
  1) GCCcore/.5.5.0 (H)   7) MVAPICH2/2.3a-GDR (g)
  2) binutils/.2.30 (H)   8) ncurses/.6.0      (H)
  3) StdEnv         (H)   9) CMake/3.11.1
  4) nvidia/.driver (H)  10) Bison/.3.0.4      (H)
  5) CUDA/9.1.85    (g)  11) flex/2.6.4
  6) GCC/5.5.0           12) imkl/2018.2.199

  Where:
   g:  Built for GPU
   H:             Hidden Module


$ git clone https://github.com/lattice/quda -b feature/multigrid
$ cd quda
$ git checkout 923a8a6592a9f50b135dd8e9419c969432249a0e
$ mkdir build
$ cd build
$ cmake \
  -DCMAKE_CXX_COMPILER=/usr/local/software/jureca/Stages/2018a/software/GCCcore/5.5.0/bin/g++ \
  -DCMAKE_C_COMPILER=/usr/local/software/jureca/Stages/2018a/software/GCCcore/5.5.0/bin/gcc \
  -DCMAKE_CUDA_HOST_COMPILER=/usr/local/software/jureca/Stages/2018a/software/GCCcore/5.5.0/bin/g++ \
  -DMPI_CXX_COMPILER=/usr/local/software/jureca/Stages/2018a/software/MVAPICH2/2.3a-GCC-5.5.0-GDR/bin/mpicxx \
  -DMPI_C_COMPILER=/usr/local/software/jureca/Stages/2018a/software/MVAPICH2/2.3a-GCC-5.5.0-GDR/bin/mpicc \
  -DQUDA_DIRAC_CLOVER=ON \
  -DQUDA_DIRAC_DOMAIN_WALL=OFF \
  -DQUDA_DIRAC_NDEG_TWISTED_MASS=ON \
  -DQUDA_DIRAC_STAGGERED=OFF \
  -DQUDA_DIRAC_TWISTED_CLOVER=ON \
  -DQUDA_DIRAC_TWISTED_MASS=ON \
  -DQUDA_DIRAC_WILSON=ON \
  -DQUDA_DYNAMIC_CLOVER=OFF \
  -DQUDA_MPI=ON \
  -DQUDA_INTERFACE_MILC=OFF \
  -DQUDA_INTERFACE_QDP=ON \
  -DQUDA_MULTIGRID=ON \
  -DQUDA_GPU_ARCH=sm_37 \
  ..     # the sources are in the parent directory
$ make -j

For QUDA_GPU_ARCH, one must specify the architecture to be used. For example, K80 -> sm_37, P100 (Pascal) -> sm_60, V100 (Volta) -> sm_70!

Compiling QUDA-enabled tmLQCD

This test is based on commit b010542a9bb4e17142fa1288b91f73d630ac0400 of the jsc_benchmark branch of https://github.com/etmc/tmLQCD. Blas/Lapack is required for tmLQCD to compile, but for the purpose of this test, it is not necessary for this to be a highly optimized library version. For Jureca, we will use imkl.

 $ module list

Currently Loaded Modules:
  1) GCCcore/.5.5.0 (H)   7) MVAPICH2/2.3a-GDR (g)
  2) binutils/.2.30 (H)   8) ncurses/.6.0      (H)
  3) StdEnv         (H)   9) CMake/3.11.1
  4) nvidia/.driver (H)  10) Bison/.3.0.4      (H)
  5) CUDA/9.1.85    (g)  11) flex/2.6.4
  6) GCC/5.5.0           12) imkl/2018.2.199

  Where:
   g:  Built for GPU
   H:             Hidden Module


$ git clone https://github.com/etmc/tmLQCD tmLQCD.quda -b jsc_benchmark
$ cd tmLQCD.quda
$ autoconf              # only autoconf and perhaps aclocal should be executed
                        # tmLQCD does not use the full autotools chain
                        # this generates 'configure'
$ mkdir build
$ cd build
$ configure --enable-halfspinor --enable-gaugecopy \
  --with-lemondir=${PATH_TO_LEMON} \
  --with-limedir=${PATH_TO_LIME} \
  --enable-mpi --with-mpidimension=4 --enable-omp \
  --disable-sse2 --disable-sse3 --enable-alignment=32 \
  --with-lapack="-L${MKLROOT}/lib/intel64 -lmkl_blas95_lp64 -lmkl_avx2 -lmkl_core -lmkl_intel_thread -lmkl_intel_lp64" \
  F77=gfortran CC=mpicc CFLAGS="-fopenmp -std=c99 -O3 -march=haswell -mtune=haswell" \
  CXXFLAGS="-fopenmp --std=c++11 -O3 -march=haswell -mtune=haswell" \
  CXX=mpicxx --with-qudadir=${PATH_TO_QUDA} --with-cudadir=${CUDA_HOME} \
  LDFLAGS=-lcuda
$ make -j

Running QUDA-enabled tmLQCD

If the job scheduler exports the environment automatically, running can be as simple as shown below for two Jureca nodes.

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --output=out.2nodes.%j.out
#SBATCH --error=err.2nodes.%j.err                                                                                                         
#SBATCH --partition=gpus
#SBATCH --gres=gpu:4

module load CUDA
module load GCC/5.5.0
module load MVAPICH2/2.3a-GDR
module load imkl

# QUDA needs a path to store the results of the auto-tuning step
# and we identify it via its git commit hash to keep track of the
# exact quda version that it is intended for
export QUDA_RESOURCE_PATH=~/misc/jureca/quda_resources/923a8a6592a9f50b135dd8e9419c969432249a0e

if [ ! -d ${QUDA_RESOURCE_PATH} ]; then
  mkdir -p ${QUDA_RESOURCE_PATH}
fi

# we disable the device memory pool to reduce memory consumption
OMP_NUM_THREADS=6 \
MV2_USE_CUDA=1 MV2_GPUDIRECT_LIMIT=1000000000 \
CUDA_DEVICE_MAX_CONNECTIONS=1 \
QUDA_ENABLE_DEVICE_MEMORY_POOL=0 \
srun ${path_to_tmlqcd}/invert -v -f invert.A30.32.quda.input

QPHIX

This test is set up on top of commit dd937915e507c5eb87e7f4c42c76a6103adda3c1 of the juelich_qphix-tmf branch of https://github.com/kostrzewa/qphix.

Compiling QMP

QMP (https://github.com/usqcd-software/qmp) is a required dependency of QPHIX. Make sure to use the MPI comms type (the default is SINGLE):

$ cd $qmp_build_dir
$ CC=mpicc ${path_to_qmp_source}/configure --prefix=${qmp_installation_dir} CFLAGS=-std=c99 --with-qmp-comms-type=MPI
$ make
$ make install

Compiling QPHIX

Compilation proceeds using the autotools chain.

$ module list
 
  Currently Loaded Modules:
    1) GCCcore/.5.4.0                7) ParaStationMPI/5.1.5-1
    2) binutils/.2.27                8) Bison/.3.0.4
    3) icc/.2017.0.098-GCC-5.4.0     9) flex/2.6.0
    4) ifort/.2017.0.098-GCC-5.4.0  10) ncurses/.6.0
    5) Intel/2017.0.098-GCC-5.4.0   11) CMake/3.6.2
    6) pscom/.Default               12) imkl/2017.0.098

$ git clone -b juelich_qphix-tmf https://github.com/kostrzewa/qphix
$ cd qphix
$ autoreconf -fi # generate all necessary autotools files
$ mkdir build
$ cd build
# for KNL, --enable-proc=AVX512
$ CC=mpicc CXX=mpicxx \
  ../configure --enable-openmp --enable-parallel-arch=parscalar --enable-soalen=4 \
  --enable-proc=AVX2 --enable-twisted-mass \
  --with-qmp=${path_to_qmp} \
  --prefix=${installation_directory}
$ make -j
$ make install

Compiling QPHIX-enabled tmLQCD

The qphix-enabled benchmark is based on commmit 996046b8a794b7587ca9af4d984120e0899580c9 of the juelich_qphix_devel branch of https://github.com/kostrzewa/tmLQCD

In the commands below, the usage of MKL is completely optional, no LAPACK routines will be called during the benchmark. OpenBLAS would be just as good.

$ module list
 
  Currently Loaded Modules:
    1) GCCcore/.5.4.0                7) ParaStationMPI/5.1.5-1
    2) binutils/.2.27                8) Bison/.3.0.4
    3) icc/.2017.0.098-GCC-5.4.0     9) flex/2.6.0
    4) ifort/.2017.0.098-GCC-5.4.0  10) ncurses/.6.0
    5) Intel/2017.0.098-GCC-5.4.0   11) CMake/3.6.2
    6) pscom/.Default               12) imkl/2017.0.098

$ git clone -b juelich_qphix_devel https://github.com/kostrzewa/tmLQCD tmLQCD.qphix
$ cd tmLQCD.qphix
# run at most aclocal and autoconf, tmLQCD does not use the full autotools set
$ [aclocal], autoconf
$ mkdir build
$ cd build
$ ../configure --with-limedir=${path_to_lime} \
  --with-lemondir=${path_to_lemon} \
  --with-mpidimension=4 --enable-omp --enable-mpi \
  --disable-sse2 --disable-sse3 \
  --with-lapack="-Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a -Wl,--end-group -lpthread -lm -ldl" \
  --enable-halfspinor --enable-gaugecopy \
  --with-qphixdir=${path_to_qphix} \
  --with-qmpdir=${path_to_qmp} \
  CC=mpicc CXX=mpicxx F77=ifort \
  CFLAGS="-O3 -std=c99 -qopenmp -xCORE-AVX2" \
  CXXFLAGS="-O3 -std=c++11 -qopenmp -xCORE-AVX2" \
  LDFLAGS="-qopenmp"
$ make -j

For KNL, one would probably use -xMIC-AVX512 in CFLAGS and CXXFLAGS. Note that you will need to also do cross-compilation because the test executables built by configure will not work on the compiling system:

../configure [...] --host=x86_64-linux-gnu [...] \
  CFLAGS="-O3 -std=c99 -qopenmp -xMIC-AVX512" \
  CXXFLAGS="-O3 -std=c++11 -qopenmp -xMIC-AVX512" \
  LDFLAGS="-qopenmp"

Running QPHIX-enabled tmLQCD

With a job scheduler which automatically exports the environment, running QPHIX-enabled tmLQCD can be done as below for two Jureca nodes.

## load modules
OMP_NUM_THREADS=12 KMP_AFFINITY=balanced srun --ntasks=4 --ntasks-per-node=2 \
--cpus-per-task=12 \
${path_to_tmLQCD}/invert -v -f invert.A30.32.qphix.input

On KNL, thread-placement probably requires fine-tuning. On dual-socket machines, it is generally beneficial to have two MPI tasks per node.

Input files

The input files are located in this gist and contain platform-related documentation in the form of comments.

A30.32

mixed precision CG

QUDA: invert.A30.32.quda.CG.input

QPhiX: invert.A30.32.qphix.input

QUDA MULTIGRID

invert.A30.32.quda.MG.input

D15.48

mixed precision CG

QUDA: invert.D15.48.quda.CG.input

QPhiX: invert.D15.48.qphix.input

QUDA MULTIGRID

invert.D15.48.quda.MG.input