Skip to content
Bartosz Kostrzewa edited this page Oct 10, 2018 · 15 revisions

QUDA

We use the develop branch of https://github.com/lattice/quda. The instructions have been compiled for commit hash 712c4396c7d17b920277f66e680c6f9fb450f674, keeping in mind that we will need to patch this slightly to add additional instantiations of block sizes for the restrictors and transfer operators, which are instantiated via the macros defined in include/launch_kernel.cuh.

For QBIG, two versions of the library need to be compiled for our kepler (lnode02-lnode12) and pascal (lnode13-lnode15) nodes.

We first clone the QUDA source code and then create a separate build directory (outside of the source tree).

On QBIG, nvcc is located in /opt/cuda-8.0/bin and it is helpful to add this to your $PATH for auto-detection to work. Alternatively, add -DCMAKE_CUDA_COMPILER=/opt/cuda-8.0/bin/nvcc to the CMake call below.

Kepler

In the build directory, run the following to configure the build. The installation directory should be chosen such as to be able to differentiate GPU architectures (either sm_35 or kepler would be appropriate choices)

#!/bin/bash
MPI_HOME=/opt/openmpi-2.0.2a1-with-pmi 
LD_LIBRARY_PATH=${MPI_HOME}/lib:${LD_LIBRARY_PATH} \
CXX=${MPI_HOME}/bin/mpicxx \
CC=${MPI_HOME}/bin/mpicc \
cmake \
-DCMAKE_INSTALL_PREFIX="${QUDA_INSTALLATION_DIRECTORY}" \
-DMPI_CXX_COMPILER=${MPI_HOME}/bin/mpicxx \
-DMPI_C_COMPILER=${MPI_HOME}/bin/mpicc \
-DQUDA_BUILD_ALL_TESTS=OFF \
-DQUDA_GPU_ARCH=sm_35 \
-DQUDA_INTERFACE_QDP=ON \
-DQUDA_INTERFACE_MILC=OFF \
-DQUDA_MPI=ON \
-DQUDA_DIRAC_WILSON=ON \
-DQUDA_DIRAC_TWISTED_MASS=ON \
-DQUDA_DIRAC_TWISTED_CLOVER=ON \
-DQUDA_DIRAC_NDEG_TWISTED_MASS=ON \
-DQUDA_DIRAC_CLOVER=ON \
-DQUDA_DYNAMIC_CLOVER=ON \
-DQUDA_DIRAC_DOMAIN_WALL=OFF \
-DQUDA_MULTIGRID=ON \
-DQUDA_DIRAC_STAGGERED=OFF ${PATH_TO_QUDA_SOURCE}
$ make -j12

will compile the code. Grab a cup of coffee, this takes about 20 minutes or so. Subsequently, QUDA must be installed

$ make install

Pascal

The instructions are identical except that -DQUDA_GPU_ARCH=sm_60 should be set.

tmLQCD

Since everything has been merged, commit 7b163ea795a3593f67c2268591107ac0c61bb752 of the master branch of https://github.com/etmc/tmLQCD is the commit of choice.

Configuration is done via:

$ cd ${TMLQCD_SOURCE_PATH}
$ autoconf

Then a build directory should be created (outside of the source directory). There, the following will configure the code.

Kepler

LD_LIBRARY_PATH=/opt/openmpi-2.0.2a1-with-pmi/lib:${LD_LIBRARY_PATH} \
CC=/opt/openmpi-2.0.2a1-with-pmi/bin/mpicc \
CXX=/opt/openmpi-2.0.2a1-with-pmi/bin/mpicxx \
LD=/opt/openmpi-2.0.2a1-with-pmi/bin/mpicxx \
CFLAGS="-O3 -fopenmp -std=c99 -march=sandybridge -mtune=sandybridge" \
CXXFLAGS="-O3 -fopenmp -std=c++11 -march=sandybridge -mtune=sandybridge" \
LDFLAGS="-fopenmp -L/opt/openmpi-2.0.2a1-with-pmi/lib" \
${TMLQCD_SOURCE_PATH}/configure \
  --enable-halfspinor --enable-gaugecopy \
  --with-limedir=${PATH_TO_LIME_COMPILED_FOR_SANDYBRIDGE} \
  --with-lemondir=${PATH_TO_LEMON_COMPILED_FOR_SANDYBRIDGE} \
  --enable-mpi --with-mpidimension=4 --enable-omp \
  --disable-sse2 --disable-sse3 \
  --with-qudadir=${KEPLER_QUDA_INSTALLATION_DIRECTORY} \
  --enable-alignment=32 --with-lapack="-llapack -lblas" \
  --with-cudadir=/opt/cuda-8.0/lib64

Pascal

LD_LIBRARY_PATH=/opt/openmpi-2.0.2a1-with-pmi/lib:${LD_LIBRARY_PATH} \
CC=/opt/openmpi-2.0.2a1-with-pmi/bin/mpicc \
CXX=/opt/openmpi-2.0.2a1-with-pmi/bin/mpicxx \
LD=/opt/openmpi-2.0.2a1-with-pmi/bin/mpicxx \
CFLAGS="-O3 -fopenmp -std=c99 -march=broadwell -mtune=broadwell" \
CXXFLAGS="-O3 -fopenmp -std=c++11 -march=broadwell -mtune=broadwell" \
LDFLAGS="-fopenmp -L/opt/openmpi-2.0.2a1-with-pmi/lib" \
${TMLQCD_SOURCE_PATH}/configure \
  --enable-halfspinor --enable-gaugecopy \
  --with-limedir=${PATH_TO_LIME_COMPILED_FOR_BROADWELL} \
  --with-lemondir=${PATH_TO_LEMON_COMPILED_FOR_BROADWELL} \
  --enable-mpi --with-mpidimension=4 --enable-omp \
  --disable-sse2 --disable-sse3 \
  --with-qudadir=${PASCAL_QUDA_INSTALLATION_DIRECTORY} \
  --enable-alignment=32 --with-lapack="-llapack -lblas" \
  --with-cudadir=/opt/cuda-8.0/lib64

Running jobs

QBIG currently consists of two sets of nodes.

  • lnode02-lnode12 are equipped with two quad-core Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz (hypterthreading is disabled) and four K20m "Kepler" generation GPUs with 4.7 GB of device memory each. Each node has 64 GB of RAM, of which about 62 can be used by compute jobs.
  • lnode13-lnode15 are equipped with two fourteen-core Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (hyperthreading is disabled).
    • lnode13 and lnode15 are equipped with eight P100 GPUs with 16GB of device memory each and 768 GB of RAM
    • lnode14 is equipped with four P100 GPUs with 16GB of device memory each and 512 GB of RAM

On the slurm side, these are partitioned into the kepler and pascal partitions and slurm is set up to highly prioritise GPU jobs (although this doesn't always work as one would like if there are many small CPU jobs occupying nodes). The maximum runtime for GPU jobs is currently 96 hours, while the maximum runtime for CPU jobs is 72 hours.

Job script

A typical job script would look something like this. See https://github.com/lattice/quda/wiki/QUDA-Environment-Variables for descriptions of the various environment variables which can be set to adjust QUDA behaviour.

#!/bin/bash -x
#SBATCH --job-name=JOBNAME
## Warning: mail capacity is quite limited at HISKP, the number of mails sent
## by jobs should be limited to a few hundred per day or so, otherwise the
## infrastructure is overwhelmed.
#SBATCH --mail-type=FAIL,END
#SBATCH [email protected]
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --mem=100G
#SBATCH --time=96:00:00
#SBATCH --gres=gpu:pascal:4
#SBATCH --partition=pascal

# Whether to allow the GDR communication pathway. This currently only works
# on selected machines and is required for inter-node only. We keep it disabled
# but should run tests in the near future to see how well we can run
# multi-node stuff.

export QUDA_ENABLE_GDR=0

# peer-to-peer copy and write pathways enabled.

export QUDA_ENABLE_P2P=3

# QUDA_RESOURCE_PATH is where QUDA will store auto-tuning data
# An instance of QUDA_RESOURCE_PATH must be created for each QUDA version and
# GPU architecture used. Even a single slight change to the QUDA source code
# will invalidate the tuning data.
# Since tuning depends on communication policy, this will need to depend on the values
# of QUDA_ENABLE_GDR and QUDA_ENABLE_P2P
# the ${QUDA_COMMIT_HASH} should be added by hand and 

export QUDA_RESOURCE_PATH=${SOME_PATH}/[pascal/kepler]_${QUDA_COMMIT_HASH}_gdr${QUDA_ENABLE_GDR}_p2p${QUDA_ENABLE_P2P}

# The device memory pool makes allocations faster, but it uses significantly more
# device memory. We disable it. There doesn't seem to be a performance impact for
# measurement tasks.

export QUDA_ENABLE_DEVICE_MEMORY_POOL=0

# generally we run with 2 threads per MPI task to make the best use
# of the machine

export OMP_NUM_THREADS=2

if [ ! -d ${QUDA_RESOURCE_PATH} ]; then
  mkdir -p ${QUDA_RESOURCE_PATH}
fi

cd ${rundir}
date > ${outfile}
srun ${exe} | tee -a ${outfile}
date >> ${outfile}

For Kepler, we use the kepler partition and GPU type and we're quite limited in the amount of memory that we have available per node.

#SBATCH --gres=gpu:kepler:4
#SBATCH --partition=kepler
#SBATCH --mem=62G

tmLQCD inverter interface

Setting up

In order to use the tmLQCD wrapper functions, you need to include the include/tmLQCD.h header. Beyond this, at the linking stage all tmLQCD libraries need to be linked (this is messy and should be changed, but it is what it is).

An example of a build script (CMake in this case) is shown here: https://github.com/kostrzewa/nyom/blob/1d1e7a4550453442d72f99464645e403f2df53f3/CMakeLists.txt#L36.

After initialising MPI, the function tmLQCD_invert_init should be called as done here: https://github.com/kostrzewa/nyom/blob/devel/include/Geometry.hpp

There, the last parameter is used only if EnableSubprocess is passed in the input file, where multiple copies of serial versions of tmLQCD are used in an MPI application and the host application must provide tmLQCD with an ID. For almost all usecases, this parameter should be passed as 0.

Inversions

There are two inversion pathways. For either of them, operators must be defined in an invert.input in the present working directory of the job with the various target parameters.

Safe inversion pathway with residual check

The slower inversion pathway is via the function

tmLQCD_invert(double * const propagator,
              double * const source,
              const int op_id, 
              const int write_prop);

where propagator and source are full tmLQCD-ordered spinor fields. op_id refers to the position of the operator to be inverted in the input file, while write_prop indicates whether after inversion, the produced propagator should be written to file in the usual format.

This inversion pathways incurrs quite a few copies of the source and propagator, but is it safer in the sense that the residual is explicitly checked after inversion.

Fast inversion pathway

The faster inversion pathway passes directly to the QUDA interface and there is no support for propagator disk I/O. Neither is the residual checked after inversion via tmLQCD operators, but there are far fewer copies made, which is of relevance on dense machines like QBIG.

invert_quda_direct(double * const propagator, 
                   double * const source,
                   const int op_id);

Usage case

Say we want to produce up and down propagators for a 32c64 twisted clover lattice. The input file for this case would look something like:

ompnumthreads = 2

L=32
T=64

DisableIOChecks = no
DebugLevel = 2
InitialStoreCounter = 0
Measurements = 1
2kappamu = 0.000840387
kappa = 0.1400645
csw = 1.74

BCAngleT = 1

GaugeConfigInputFile = /hiskp4/gauges/nf211/cA211a.30.32/conf
UseEvenOdd = yes
UseRelativePrecision = yes

EnableSubProcess = no

## setting this to 'yes' results in tmLQCD running in a low memory
## mode where none of the Dirac operators will function correctly
## (residual checking is thus impossible and only the direct QUDA
## pathway is supported). Also, the plaquette of the gauge field
## will come out incorrect, altough the gauge field is of course
## allocated correctly. This might be necessary to be able to run
## a 32c64 lattice on the Kepler nodes while having enough host memory
## left over to do useful things.
EnableLowmem = no

## these are ignored
SourceType = Volume
ReadSource = no
NoSamples = 1
####################

ranluxdlevel=2
reproducerandomnumbers=no

BeginExternalInverter QUDA
  MGNumberOfLevels = 3
  MgEnableSizeThreeBlocks = no
  MGCoarseMuFactor = 1.0, 1.0, 3.0
  MGSetupMaxSolverIterations = 1000
  MGSetupSolverTolerance = 5e-7
  MGSetupSolver = cg
  MGSmootherPreIterations = 0
  MGSmootherPostIterations = 3
  MGSmootherTolerance = 0.35
  MGOverUnderRelaxationFactor = 0.85
  MGCoarseSolverTolerance = 0.45
  MGCoarseMaxSolverIterations = 50
  MGRunVerify = no
EndExternalInverter

BeginOperator CLOVER
  2kappamu = 0.000840387
  csw = 1.74
  kappa = 0.1400645
  Solver = mg
  useexternalinverter = quda
  SolverPrecision = 1e-19
  MaxSolverIterations = 500
  usesloppyprecision = single
  usecompression = 18
  useevenodd = yes
EndOperator

BeginOperator CLOVER
  2kappamu = -0.000840387
  csw = 1.74
  kappa = 0.1400645
  Solver = mg
  useexternalinverter = quda
  SolverPrecision = 1e-19
  MaxSolverIterations = 500
  usesloppyprecision = single
  usecompression = 18
  useevenodd = yes
EndOperator

where the first, "up", operator would correspond to op_id=0 and the second "down" operator to op_id=1.

Note that because the MG setup is refreshed when the sign of mu is changed, all "up" inversions should be done in one go and only then should the sign be switched. The same is true for increasing the quark mass. The first run of the MG should always be with the lightest mass to be inverted such that the setup is good.

The various parameters of the MG must be tuned for optimum performance. Code is available for doing this and will be documented below.

Tuning MG parameters