-
Notifications
You must be signed in to change notification settings - Fork 47
QBIG compilation
We use the develop
branch of https://github.com/lattice/quda. The instructions have been compiled for commit hash 712c4396c7d17b920277f66e680c6f9fb450f674
, keeping in mind that we will need to patch this slightly to add additional instantiations of block sizes for the restrictors and transfer operators, which are instantiated via the macros defined in include/launch_kernel.cuh
.
For QBIG, two versions of the library need to be compiled for our kepler (lnode02-lnode12) and pascal (lnode13-lnode15) nodes.
We first clone the QUDA source code and then create a separate build directory (outside of the source tree).
On QBIG, nvcc is located in /opt/cuda-8.0/bin
and it is helpful to add this to your $PATH
for auto-detection to work. Alternatively, add -DCMAKE_CUDA_COMPILER=/opt/cuda-8.0/bin/nvcc
to the CMake call below.
In the build directory, run the following to configure the build. The installation directory should be chosen such as to be able to differentiate GPU architectures (either sm_35
or kepler
would be appropriate choices)
#!/bin/bash
MPI_HOME=/opt/openmpi-2.0.2a1-with-pmi
LD_LIBRARY_PATH=${MPI_HOME}/lib:${LD_LIBRARY_PATH} \
CXX=${MPI_HOME}/bin/mpicxx \
CC=${MPI_HOME}/bin/mpicc \
cmake \
-DCMAKE_INSTALL_PREFIX="${QUDA_INSTALLATION_DIRECTORY}" \
-DMPI_CXX_COMPILER=${MPI_HOME}/bin/mpicxx \
-DMPI_C_COMPILER=${MPI_HOME}/bin/mpicc \
-DQUDA_BUILD_ALL_TESTS=OFF \
-DQUDA_GPU_ARCH=sm_35 \
-DQUDA_INTERFACE_QDP=ON \
-DQUDA_INTERFACE_MILC=OFF \
-DQUDA_MPI=ON \
-DQUDA_DIRAC_WILSON=ON \
-DQUDA_DIRAC_TWISTED_MASS=ON \
-DQUDA_DIRAC_TWISTED_CLOVER=ON \
-DQUDA_DIRAC_NDEG_TWISTED_MASS=ON \
-DQUDA_DIRAC_CLOVER=ON \
-DQUDA_DYNAMIC_CLOVER=ON \
-DQUDA_DIRAC_DOMAIN_WALL=OFF \
-DQUDA_MULTIGRID=ON \
-DQUDA_DIRAC_STAGGERED=OFF ${PATH_TO_QUDA_SOURCE}
$ make -j12
will compile the code. Grab a cup of coffee, this takes about 20 minutes or so. Subsequently, QUDA must be installed
$ make install
The instructions are identical except that -DQUDA_GPU_ARCH=sm_60
should be set.
Since everything has been merged, commit 7b163ea795a3593f67c2268591107ac0c61bb752
of the master
branch of https://github.com/etmc/tmLQCD is the commit of choice.
Configuration is done via:
$ cd ${TMLQCD_SOURCE_PATH}
$ autoconf
Then a build directory should be created (outside of the source directory). There, the following will configure the code.
LD_LIBRARY_PATH=/opt/openmpi-2.0.2a1-with-pmi/lib:${LD_LIBRARY_PATH} \
CC=/opt/openmpi-2.0.2a1-with-pmi/bin/mpicc \
CXX=/opt/openmpi-2.0.2a1-with-pmi/bin/mpicxx \
LD=/opt/openmpi-2.0.2a1-with-pmi/bin/mpicxx \
CFLAGS="-O3 -fopenmp -std=c99 -march=sandybridge -mtune=sandybridge" \
CXXFLAGS="-O3 -fopenmp -std=c++11 -march=sandybridge -mtune=sandybridge" \
LDFLAGS="-fopenmp -L/opt/openmpi-2.0.2a1-with-pmi/lib" \
${TMLQCD_SOURCE_PATH}/configure \
--enable-halfspinor --enable-gaugecopy \
--with-limedir=${PATH_TO_LIME_COMPILED_FOR_SANDYBRIDGE} \
--with-lemondir=${PATH_TO_LEMON_COMPILED_FOR_SANDYBRIDGE} \
--enable-mpi --with-mpidimension=4 --enable-omp \
--disable-sse2 --disable-sse3 \
--with-qudadir=${KEPLER_QUDA_INSTALLATION_DIRECTORY} \
--enable-alignment=32 --with-lapack="-llapack -lblas" \
--with-cudadir=/opt/cuda-8.0/lib64
LD_LIBRARY_PATH=/opt/openmpi-2.0.2a1-with-pmi/lib:${LD_LIBRARY_PATH} \
CC=/opt/openmpi-2.0.2a1-with-pmi/bin/mpicc \
CXX=/opt/openmpi-2.0.2a1-with-pmi/bin/mpicxx \
LD=/opt/openmpi-2.0.2a1-with-pmi/bin/mpicxx \
CFLAGS="-O3 -fopenmp -std=c99 -march=broadwell -mtune=broadwell" \
CXXFLAGS="-O3 -fopenmp -std=c++11 -march=broadwell -mtune=broadwell" \
LDFLAGS="-fopenmp -L/opt/openmpi-2.0.2a1-with-pmi/lib" \
${TMLQCD_SOURCE_PATH}/configure \
--enable-halfspinor --enable-gaugecopy \
--with-limedir=${PATH_TO_LIME_COMPILED_FOR_BROADWELL} \
--with-lemondir=${PATH_TO_LEMON_COMPILED_FOR_BROADWELL} \
--enable-mpi --with-mpidimension=4 --enable-omp \
--disable-sse2 --disable-sse3 \
--with-qudadir=${PASCAL_QUDA_INSTALLATION_DIRECTORY} \
--enable-alignment=32 --with-lapack="-llapack -lblas" \
--with-cudadir=/opt/cuda-8.0/lib64
QBIG currently consists of two sets of nodes.
- lnode02-lnode12 are equipped with two quad-core Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz (hypterthreading is disabled) and four
K20m
"Kepler" generation GPUs with 4.7 GB of device memory each. Each node has 64 GB of RAM, of which about 62 can be used by compute jobs. - lnode13-lnode15 are equipped with two fourteen-core Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (hyperthreading is disabled).
- lnode13 and lnode15 are equipped with eight P100 GPUs with 16GB of device memory each and 768 GB of RAM
- lnode14 is equipped with four P100 GPUs with 16GB of device memory each and 512 GB of RAM
On the slurm side, these are partitioned into the kepler
and pascal
partitions and slurm is set up to highly prioritise GPU jobs (although this doesn't always work as one would like if there are many small CPU jobs occupying nodes). The maximum runtime for GPU jobs is currently 96 hours, while the maximum runtime for CPU jobs is 72 hours.
A typical job script would look something like this. See https://github.com/lattice/quda/wiki/QUDA-Environment-Variables for descriptions of the various environment variables which can be set to adjust QUDA behaviour.
#!/bin/bash -x
#SBATCH --job-name=JOBNAME
## Warning: mail capacity is quite limited at HISKP, the number of mails sent
## by jobs should be limited to a few hundred per day or so, otherwise the
## infrastructure is overwhelmed.
#SBATCH --mail-type=FAIL,END
#SBATCH [email protected]
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --mem=100G
#SBATCH --time=96:00:00
#SBATCH --gres=gpu:pascal:4
#SBATCH --partition=pascal
# Whether to allow the GDR communication pathway. This currently only works
# on selected machines and is required for inter-node only. We keep it disabled
# but should run tests in the near future to see how well we can run
# multi-node stuff.
export QUDA_ENABLE_GDR=0
# peer-to-peer copy and write pathways enabled.
export QUDA_ENABLE_P2P=3
# QUDA_RESOURCE_PATH is where QUDA will store auto-tuning data
# An instance of QUDA_RESOURCE_PATH must be created for each QUDA version and
# GPU architecture used. Even a single slight change to the QUDA source code
# will invalidate the tuning data.
# Since tuning depends on communication policy, this will need to depend on the values
# of QUDA_ENABLE_GDR and QUDA_ENABLE_P2P
# the ${QUDA_COMMIT_HASH} should be added by hand and
export QUDA_RESOURCE_PATH=${SOME_PATH}/[pascal/kepler]_${QUDA_COMMIT_HASH}_gdr${QUDA_ENABLE_GDR}_p2p${QUDA_ENABLE_P2P}
# The device memory pool makes allocations faster, but it uses significantly more
# device memory. We disable it. There doesn't seem to be a performance impact for
# measurement tasks.
export QUDA_ENABLE_DEVICE_MEMORY_POOL=0
# generally we run with 2 threads per MPI task to make the best use
# of the machine
export OMP_NUM_THREADS=2
if [ ! -d ${QUDA_RESOURCE_PATH} ]; then
mkdir -p ${QUDA_RESOURCE_PATH}
fi
cd ${rundir}
date > ${outfile}
srun ${exe} | tee -a ${outfile}
date >> ${outfile}
For Kepler, we use the kepler
partition and GPU type and we're quite limited in the amount of memory that we have available per node.
#SBATCH --gres=gpu:kepler:4
#SBATCH --partition=kepler
#SBATCH --mem=62G
In order to use the tmLQCD wrapper functions, you need to include the include/tmLQCD.h
header. Beyond this, at the linking stage all tmLQCD libraries need to be linked (this is messy and should be changed, but it is what it is).
An example of a build script (CMake in this case) is shown here: https://github.com/kostrzewa/nyom/blob/1d1e7a4550453442d72f99464645e403f2df53f3/CMakeLists.txt#L36.
After initialising MPI, the function tmLQCD_invert_init
should be called as done here: https://github.com/kostrzewa/nyom/blob/devel/include/Geometry.hpp
There, the last parameter is used only if EnableSubprocess
is passed in the input file, where multiple copies of serial versions of tmLQCD are used in an MPI application and the host application must provide tmLQCD with an ID. For almost all usecases, this parameter should be passed as 0
.
There are two inversion pathways. For either of them, operators must be defined in an invert.input
in the present working directory of the job with the various target parameters.
The slower inversion pathway is via the function
tmLQCD_invert(double * const propagator,
double * const source,
const int op_id,
const int write_prop);
where propagator
and source
are full tmLQCD-ordered spinor fields. op_id
refers to the position of the operator to be inverted in the input file, while write_prop
indicates whether after inversion, the produced propagator should be written to file in the usual format.
This inversion pathways incurrs quite a few copies of the source and propagator, but is it safer in the sense that the residual is explicitly checked after inversion.
The faster inversion pathway passes directly to the QUDA interface and there is no support for propagator disk I/O. Neither is the residual checked after inversion via tmLQCD operators, but there are far fewer copies made, which is of relevance on dense machines like QBIG.
invert_quda_direct(double * const propagator,
double * const source,
const int op_id);
Say we want to produce up and down propagators for a 32c64 twisted clover lattice. The input file for this case would look something like:
ompnumthreads = 2
L=32
T=64
DisableIOChecks = no
DebugLevel = 2
InitialStoreCounter = 0
Measurements = 1
2kappamu = 0.000840387
kappa = 0.1400645
csw = 1.74
BCAngleT = 1
GaugeConfigInputFile = /hiskp4/gauges/nf211/cA211a.30.32/conf
UseEvenOdd = yes
UseRelativePrecision = yes
EnableSubProcess = no
## setting this to 'yes' results in tmLQCD running in a low memory
## mode where none of the Dirac operators will function correctly
## (residual checking is thus impossible and only the direct QUDA
## pathway is supported). Also, the plaquette of the gauge field
## will come out incorrect, altough the gauge field is of course
## allocated correctly. This might be necessary to be able to run
## a 32c64 lattice on the Kepler nodes while having enough host memory
## left over to do useful things.
EnableLowmem = no
## these are ignored
SourceType = Volume
ReadSource = no
NoSamples = 1
####################
ranluxdlevel=2
reproducerandomnumbers=no
BeginExternalInverter QUDA
MGNumberOfLevels = 3
MgEnableSizeThreeBlocks = no
MGCoarseMuFactor = 1.0, 1.0, 3.0
MGSetupMaxSolverIterations = 1000
MGSetupSolverTolerance = 5e-7
MGSetupSolver = cg
MGSmootherPreIterations = 0
MGSmootherPostIterations = 3
MGSmootherTolerance = 0.35
MGOverUnderRelaxationFactor = 0.85
MGCoarseSolverTolerance = 0.45
MGCoarseMaxSolverIterations = 50
MGRunVerify = no
EndExternalInverter
BeginOperator CLOVER
2kappamu = 0.000840387
csw = 1.74
kappa = 0.1400645
Solver = mg
useexternalinverter = quda
SolverPrecision = 1e-19
MaxSolverIterations = 500
usesloppyprecision = single
usecompression = 18
useevenodd = yes
EndOperator
BeginOperator CLOVER
2kappamu = -0.000840387
csw = 1.74
kappa = 0.1400645
Solver = mg
useexternalinverter = quda
SolverPrecision = 1e-19
MaxSolverIterations = 500
usesloppyprecision = single
usecompression = 18
useevenodd = yes
EndOperator
where the first, "up", operator would correspond to op_id=0
and the second "down" operator to op_id=1
.
Note that because the MG setup is refreshed when the sign of mu is changed, all "up" inversions should be done in one go and only then should the sign be switched. The same is true for increasing the quark mass. The first run of the MG should always be with the lightest mass to be inverted such that the setup is good.
The various parameters of the MG must be tuned for optimum performance. Code is available for doing this and will be documented below.