benchmark.py
implements a minimalist framework to evaluate the performance of
a Devito Operator while varying:
- the problem size (e.g., shape of the computational grid);
- the discretization (e.g., space- and time-order of the input/output fields);
- the simulation time (in milliseconds);
- the performance optimization level;
- the autotuning level.
Running python benchmark.py --help
will display a list of useful options.
If you are tempted to use your laptop to run a benchmark, you may want to reconsider: heat and power management may affect the results you get in an unpredictable way.
It is important that both the Python process running Devito (processes if running with MPI) and the OpenMP threads spawned while running an Operator are pinned to specific CPU cores, to get reliable and deterministic results. There are several ways to achieve this:
- Through environment variables. All MPI/OpenMP distributions provide a set of
environment variables to control process/thread pinning. Devito also
supplies the
set_omp_pinning.sh
program (under/scripts
), which helps with thread pinning (though, currently, only limited to Intel architectures). - Through a program such as
numactl
ortaskset
.
If running on a NUMA system, where multiple nodes of CPU cores ("sockets") and memory are available, pinning becomes even more important (and therefore deserves more attention). On a NUMA system, a core can access both local and remote memory nodes, but the latency is obviously smaller in the former case. Thus, if a process/thread is pinned to a given core, it is important that as much accessed data as possible is allocated in local memory (ideally, the entire working set). There are multiple scenarios that are worth considering:
- Purely sequential run (no OpenMP, no MPI). Use
numactl
ortaskset
to pin the Python process. On a NUMA system, this also ensures that all data gets allocated in local memory. - OpenMP-only run. When encountering a parallel region, an Operator will spawn a
certain number of OpenMP threads. By default, as many threads as available
logical cores are created. This can be changed by setting the OpenMP-standard
OMP_NUM_THREADS
environment variable to a different value. When might we want to do this?- Unless on a hyperthreads-centerd system, such as an Intel Knights Landing, spawning only as many threads as physical cores usually results in slightly better performance due to less contention for hardware resources.
- Since, here, we are merely interested in benchmarking, when running on a
NUMA system we should restrain an Operator to run on a single node of CPU
cores (i.e., a single socket), as in practice one should always use MPI
across multiple sockets (hence resorting to the MPI+OpenMP mode). There is
also one caveat: we need to make sure that the Python process runs on the
same socket as the OpenMP threads, otherwise some data might get allocated
on remote memory. For this,
numactl
ortaskset
are our friends.
- MPI-only run. Process pinning should be implemented exploiting the proper MPI environment variables.
- MPI+OpenMP. The typical execution mode is: one MPI process per socket, and each MPI process spawns a set of OpenMP threads upon entering an OpenMP parallel region. Pinning is typically enforced via environment variables.
Some more information about pinning is available here.
There are many ways one can check that pinning is working as expected. A recommended tool for rapid visual inspection is htop.
To switch on multi-threaded execution of Operators via OpenMP, the following environment variable must be set:
DEVITO_LANGUAGE=openmp
One has two options: either set it explicitly or prepend it to the Python command. In the former case, assuming a bash shell:
export DEVITO_LANGUAGE=openmp
In the latter case:
DEVITO_LANGUAGE=openmp python benchmark.py ...
To switch on MPI, one should set
DEVITO_MPI=1
and run with mpirun -n number_of_processes python benchmark.py ...
Devito supports multiple MPI schemes for halo exchange. See the Tips
section
below.
benchmark.py
allows to set optimization mode, as well as several optimization
options, via the --opt
argument. Please refer to
this
notebook for a comprehensive list of all optimization modes and options
available in Devito. You may also want to take a look at the example command
lines a few sections below.
Auto-tuning can significantly improve the run-time performance of an Operator. It can be enabled on an Operator basis:
op = Operator(...)
op.apply(autotune=True)
The auto-tuner will discover a suitable block shape for each blocked loop nest in the generated code.
With autotune=True
, the auto-tuner operates in basic
mode, which only attempts
a small batch of block shapes. With autotune='aggressive'
, the auto-tuning phase
will likely take up more time, but it will also evaluate more block shapes.
By default, benchmark.py
runs Operators with auto-tuning in aggressive mode,
that is as op.apply(autotune='aggressive')
. This can be changed with the
-a/--autotune
flags. In particular, benchmark.py
uses the so called
"pre-emptive" auto-tuning, which implies two things:
- The Operator's output fields are copied, and the Operator will write to these copies while auto-tuning. So the memory footprint is temporarily larger during this phase.
- The auto-tuning phase produces values that are eventually ditched; afterwards, the actual computation takes place. The execution time of the latter does not include auto-tuning.
Note that in production runs one should/would rather use the so called "runtime auto-tuning":
op.apply(autotune=('aggressive', 'runtime'))
in which auto-tuning, as the name suggests, will be performed during the first N timesteps of the actual computation, after which the best block shapes are selected and used for all remaining timesteps.
The "backend compiler" takes as input the code generated by Devito and
translates it into a shared object. Supported backend compilers are gcc
,
icc
, pgcc
, clang
. For each of these compilers, Devito uses some preset compilation
flags (e.g., -O3, -march=native, etc).
The default backend compiler is gcc
. To change it, one should set the
DEVITO_ARCH
environment variable to a different value; run
from devito import print_defaults
print_defaults()
to get all possible DEVITO_ARCH
values.
Run with DEVITO_LOGGING=DEBUG
to find out the specific performance
optimizations applied by an Operator, how auto-tuning is getting along, and to
emit more performance metrics.
- The most powerful MPI mode is called "full", and is activated setting
DEVITO_MPI=full
instead ofDEVITO_MPI=1
. The "full" mode supports computation/communication overlap. - When auto-tuning is enabled, one should always run in performance mode:
This is automatically turned on by
from devito import mode_performance mode_perfomance()
benchmark.py
The isotropic acoustic wave forward Operator in a 512**3
grid, space order
12, and a simulation time of 100ms:
python benchmark.py run -P acoustic -d 512 512 512 -so 12 --tn 100
Like before, but with auto-tuning in basic
mode:
python benchmark.py run -P acoustic -d 512 512 512 -so 12 -a basic --tn 100
It is also possible to run a TTI forward operator -- here in a 512x402x890 grid:
python benchmark.py run -P tti -d 512 402 890 -so 12 -a basic --tn 100
Same as before, but telling devito not to use temporaries to store the intermediate values which stem from mixed derivatives:
python benchmark.py run -P tti -d 512 402 890 -so 12 -a basic --tn 100 --opt
"('advanced', {'cire-mingain: 1000000'})"
Do not forget to pin processes, especially on NUMA systems; below, we use
numactl
to pin processes and threads to one specific NUMA domain.
numactl --cpubind=0 --membind=0 python benchmark.py ...
While a benchmark is running, you can have some useful programs running in background in other shells. For example, to monitor pinning:
htop
or to keep the memory footprint under control:
watch numastat -m
As of Devito v3.5 it is possible to customize the code generated by Devito.
This is often referred to as the "JIT backdoor"
mode.
With benchmark.py
we can exploit this feature to manually hack and test the
code generated for a given benchmark. So, we first run a problem, for example
python benchmark.py run-jit-backdoor -P acoustic -d 512 512 512 -so 12 --tn 100
As you may expect, the run-jit-backdoor
mode accepts exactly the same arguments
as the run
mode. Eventually, you will see a message along the lines of
You may now edit the generated code in
`/tmp/devito-jitcache-uid1000/31e8d25408f369754e2b7a26f4439944dc7683e2.c`. Then
save the file, and re-run this benchmark.
At this point, just follow the instructions on screen. The next time you run the benchmark, the modified C code will be re-compiled and executed. Thus, you will see the performance impact of your changes.
benchmark.py
can be used to evaluate MPI on multi-node systems:
mpiexec python benchmark.py ...
In bench
mode, each MPI rank will produce a different .json
file
summarizing the achieved performance in a structured format.
Further, we provide make-pbs.py
, a simple program to generate PBS files
to submit jobs on HPC clusters. Take a look at python make-pbs.py --help
for more information, and in particular python make-pbs.py generate --help
.
make-pbs.py
is especially indicated if interested in running strong scaling
experiments.
The GFlops/s and GPoints/s performance, Operational Intensity (OI) and execution time are emitted to standard output at the end of each run. You may find this FAQ useful.