This Truffle language exposes GPUs to the polyglot GraalVM. The goal is to
- Make data exchange between the host language and the GPUs efficient without burdening the programmer.
- Allow programmers to invoke existing GPU kernels from their host language.
Supported and tested GraalVM languages:
- Python
- JavaScript/NodeJS
- Ruby
- R
- Java
- C and Rust through the Graal Sulong Component
A description of GrCUDA and its the features can be found in the GrCUDA documentation.
The bindings documentation contains a tutorial that shows how to bind precompiled kernels to callables, compile and launch kernels.
Additional Information:
- GrCUDA: A Polyglot Language Binding for CUDA in GraalVM. NVIDIA Developer Blog, November 2019.
- GrCUDA: A Polyglot Language Binding. Presentation at Oracle CodeOne 2019, September 2019.
- Simplifying GPU Access. Presentation at NVIDIA GTC 2020, March 2020.
- DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime. Paper at IPDPS 2021 on the GrCUDA scheduler, May 2021. Video of the presentation.
GrCUDA can be used in the binaries of the GraalVM languages (lli
, graalpython
, js
, R
, and ruby
).
The JAR file containing GrCUDA must be appended to the classpath or copied into jre/languages/grcuda
(Java 8) or languages/grcuda
(Java 11) of the Graal installation.
Note that --jvm
and --polyglot
must be specified in both cases as well.
The following example shows how to create a GPU kernel and two device arrays in JavaScript (NodeJS) and invoke the kernel:
// build kernel from CUDA C/C++ source code
const kernelSource = `
__global__ void increment(int *arr, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
arr[idx] += 1;
}
}`
const cu = Polyglot.eval('grcuda', 'CU') // get GrCUDA namespace object
const incKernel = cu.buildkernel(
kernelSource, // CUDA kernel source code string
'increment', // kernel name
'pointer, sint32') // kernel signature
// allocate device array
const numElements = 100
const deviceArray = cu.DeviceArray('int', numElements)
for (let i = 0; i < numElements; i++) {
deviceArray[i] = i // ... and initialize on the host
}
// launch kernel in grid of 1 block with 128 threads
incKernel(1, 128)(deviceArray, numElements)
// print elements from updated array
for (const element of deviceArray) {
console.log(element)
}
$GRAALVM_DIR/bin/node --polyglot --jvm example.js
1
2
...
100
The next example shows how to launch an existing compiled GPU kernel from Python. The CUDA kernel
__global__ void increment(int *arr, int n) {
auto idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
arr[idx] += 1;
}
}
is compiled using nvcc --cubin
into a cubin file. The kernel function can be loaded from the cubin and bound to a callable object in the host language, here Python.
import polyglot
num_elements = 100
cu = polyglot.eval(language='grcuda', string='CU')
device_array = cu.DeviceArray('int', num_elements)
for i in range(num_elements):
device_array[i] = i
# bind to kernel from binary
inc_kernel = cu.bindkernel('kernel.cubin',
'cxx increment(arr: inout pointer sint32, n: sint32)')
# launch kernel as 1 block with 128 threads
inc_kernel(1, 128)(device_array, num_elements)
for i in range(num_elements):
print(device_array[i])
nvcc --cubin --generate-code arch=compute_75,code=sm_75 kernel.cu
$GRAALVM_DIR/bin/graalpython --polyglot --jvm example.py
1
2
...
100
For more details on how to invoke existing GPU kernels, see the Documentation on polyglot kernel launches.
GrCUDA can either be installed from an existing release, or built from the source files using the mx
build tool.
In both cases, it is recommended to follow these extra steps to ensure that your installation is working properly.
GrCUDA can be downloaded as a binary JAR from grcuda/releases and manually copied into a GraalVM installation. The original version of GrCUDA is available here.
- Download GraalVM CE 22.1.0 for Linux
graalvm-ce-java11-linux-amd64-21.2.0.tar.gz
from GitHub and untar it in your installation directory.
cd <your installation directory>
wget https://github.com/graalvm/graalvm-ce-builds/releases/download/vm-22.1.0/graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
tar xfz graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
rm graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
export GRAALVM_DIR=`pwd`/graalvm-ce-java11-22.1.0
- Download the GrCUDA JAR from grcuda/releases. If using the original release from NVIDIA, the latest features (e.g. the asynchronous scheduler, multi-GPU support) are not available.
cd $GRAALVM_DIR/languages
mkdir grcuda
cp <download folder>/grcuda.jar grcuda
- Test GrCUDA in Node.JS from GraalVM.
cd $GRAALVM_DIR/bin
./node --jvm --polyglot
> arr = Polyglot.eval('grcuda', 'int[5]')
[Array: null prototype] [ 0, 0, 0, 0, 0 ]
- Download other GraalVM languages.
cd $GRAAL_VM/bin
./gu available
./gu install python
./gu install R
./gu install ruby
If you want to build GrCUDA yourself, instead of using an existing release, you will need a couple of extra steps.
This section contains all the steps required to setup GrCUDA if your goal is to contribute to its development, or simply hack with it.
For simplicity, let's assume that your installation is done in your home directory, ~
.
If you are installing GrCUDA on a new machine, you can simply follow or execute oci_setup/setup_machine_from_scratch.sh
first, and then oci_setup/setup_graalvm.sh
.
Here we repeat the same steps, with additional comments.
The installation process has been validated with CUDA 11.4 - 11.7 and Ubuntu 20.04.
The same oci_setup
has a number of useful scripts to configure machines on OCI and easily use GrCUDA.
- First, download GraalVM 22.1 as above.
wget https://github.com/graalvm/graalvm-ce-builds/releases/download/vm-22.1.0/graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
tar xfz graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
rm graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
export GRAALVM_DIR=~/graalvm-ce-java11-22.1.0
- To build GrCUDA, you also need a custom JDK that is used to build GraalVM.
wget https://github.com/graalvm/labs-openjdk-11/releases/download/jvmci-22.1-b01/labsjdk-ce-11.0.15+2-jvmci-22.1-b01-linux-amd64.tar.gz
tar xfz labsjdk-ce-11.0.15+2-jvmci-22.1-b01-linux-amd64.tar.gz
rm labsjdk-ce-11.0.15+2-jvmci-22.1-b01-linux-amd64.tar.gz
export JAVA_HOME=~/labsjdk-ce-11.0.15-jvmci-22.1-b01
- GrCUDA requires the mx build tool.
Clone the mx repository and add the directory into
$PATH
, such that themx
can be invoked from the command line. We checkout the commit corresponding to the current GraalVM release.
git clone https://github.com/graalvm/mx.git
cd mx
git checkout 722b86b8ef87fbb297f7e33ee6014bbbd3f4a3a8
cd ..
- You might also want the source files for GraalVM CE, at the commit corresponding to the current release of GraalVM. This is not required for building, but if you want to modify GrCUDA's source code, it is useful to also have access to GraalVM's code.
git clone https://github.com/oracle/graal.git
cd graal
git checkout 84541b16ae8a8726a0e7d76c7179d94a57ed84ee
cd ..
- Last but not least, build GrCUDA
cd <directory containing this README>
./install.sh
- Setup your CUDA environment
- Install CUDA and Nvidia drivers, for example following the steps here
- Add the following to your environment (assuming you have installed CUDA in the default
/usr/local
location, and using thenvcc
compiler. Add these lines to~/.bashrc
to make them permanent.
export CUDA_DIR=/usr/local/cuda
export PATH=$PATH:$CUDA_DIR/bin
- Setup your GraalVM and GrCUDA environment
- Add the following to your environment (assuming you have installed the releases mentioned in step 2 and 3). Add these lines to
~/.bashrc
to make them permanent.
export PATH=~/mx:$PATH
export JAVA_HOME=~/labsjdk-ce-11.0.15-jvmci-22.1-b01
export GRAAL_HOME=~/graalvm-ce-java11-22.1.0
export GRAALVM_HOME=$GRAAL_HOME
export PATH=$GRAAL_HOME/bin:$PATH
export PATH=$JAVA_HOME/bin:$PATH
export GRCUDA_HOME=~/grcuda
source ~/.bashrc
to make changes available.
- Install languages for GraalVM (optional, but recommended)
gu available
gu install native-image
gu install llvm-toolchain
gu install python
gu install nodejs
gu rebuild-images polyglot
- If Graalpython is installed, create a
virtualenv
for it
graalpython -m venv ~/graalpython_venv
source ~/graalpython_venv/bin/activate
- Recommended: install
numpy
in Graalpython (required for running GrCUDA benchmarks)
graalpython -m ginstall install setuptools;
graalpython -m ginstall install Cython;
graalpython -m ginstall install numpy;
- Run GrCUDA Unit tests using
mx unittest com.nvidia
# To run a specific test, you can use
mx unittest com.nvidia.grcuda.test.BuildKernelTest#testBuildKernelwithNFILegacytSignature
- Setup the grcuda-data submodule
The
grcuda-data
repository is used as agit
submodule to store data, results, and plots for demos, benchmarks, and publications. You will need this submodule to run the full benchmark suite, and some of the demos. To setup the submodule, follow thisREADME
.
To develop GrCUDA, you will greatly benefit from having an IDE that allows jumping between symbols and debugging individual tests. Here, we explain how to setup IntelliJ Idea.
-
mx ideinit
from$GRCUDA_HOME
, to setup the IDE -
Open Idea and select "open project", then open GrCUDA
-
See this guide to configure the syntax checker
File -> Settings -> Plugins -> Marketplace -> Search "Eclipse Code Formatter" and install it
-
In IntelliJ Idea, install the Python plugin with
Settings -> Plugin -> Search "Python"
, then doProject Structure -> SDKs -> Create a new Python 3.8 Virtual Environment
, it is used bymx
-
Select the right JVM. It should select automatically your
$JAVA_HOME
. Othewise,Project Structures -> Modules -> Set the Module SDK (under Dependencies)
ofmx
and submodules to your Java SDK (e.g.11
). You can pick either thelabsjdk
orgraalvm
.-
This is also given by the
configure
option if you try to build the project in IntelliJ Idea before setting these options. Set your project Java SDK (e.g.11
) for those missing modules -
When building for the first time in Intellij Idea, you might get errors like
cannot use --export for target 1.8
, which means that some package is being build with Java 8. -
For these packages, there are two possible solutions. Try either of them, and stick to the one that works for you
a. For those packages (look at the log to find them), manually specify a more recent SDK (e.g.
11
) as you did in step above. If you get errors of missing symbols, follow IntelliJ's hints and export the requested packagesb. Remove the exports.
File -> Settings -> Build ... -> Compiler -> Java Compiler
, then remove all the--export
flags.
-
-
To run tests:
a. Go to
Run (top bar) -> Edit Configurations -> Edit configuration templates -> Junit
b. (Not always necessary) By default, Idea should use your
env
. If not, make sure to have the same. Update thePATH
variable so that it can findnvcc
, and export$GRAAL_HOME
. Seesetup_machine_from_scratch.sh
to find all the environment variables.c. Modify the template Junit test configuration adding
-Djava.library.path="$GRAAL_HOME/lib"
(in Java 11) to the VM options to findtrufflenfi
d. In IntelliJ Idea,
Run -> Edit Configurations
. Create a new JUnit configuration set toAll in package
withcom.nvidia.grcuda
as module andcom.nvidia.grcuda.test
selected below. Add-Djava.library.path="$GRAAL_HOME/lib"
(or your version of GraalVM) if it's not already in VM options. Specify the SDK by setting the GraalVM JRE in e.g.$GRAAL_HOME
, if not specified already.e. If you change something in GrCUDA, rebuild it with
./install.sh
before running tests. That's because tests that use the GPU load the.jar
in$GRAAL_HOME
, which is updated by./install.sh
To measure the performance of GrCUDA on complex GPU applications, we have developed a custom benchmark suite, found in projects/resources/python/benchmark
.
The benchmark suite includes those used in the DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime paper.
All commands are executed from $GRCUDA_HOME/projects/resources/python/benchmark
;
Run a single benchmark with custom settings
graalpython --jvm --polyglot --experimental-options --grcuda.ExecutionPolicy=async --grcuda.DependencyPolicy=with-const --grcuda.RetrieveNewStreamPolicy=always-new --grcuda.RetrieveParentStreamPolicy=disjoint benchmark_main.py -d -i 10 -n 4800 --no_cpu_validation --reinit false --realloc false -b b10
Run all benchmarks
graalpython --jvm --polyglot benchmark_wrapper.py -d -i 30
To run the CUDA version of all benchmarks, build it as follows. You might want to update the GPU architecture (the -arch
flag) inside $GRCUDA_HOME/projects/resources/cuda/Makefile
to reflect the hardware at your disposal.
cd $GRCUDA_HOME/projects/resources/cuda;
make
cd -;
Run the CUDA version of all benchmarks
graalpython --jvm --polyglot benchmark_wrapper.py -d -i 30 -c
To print the Java Stack Trace in case of exceptions, add the following to Graalpython
graalpython --python.ExposeInternalSources --python.WithJavaStacktrace=1 --experimental-options <your-benchmark-command>
Profile a specific benchmark using nvprof
. Running nvprof
as sudo
might not be required, see here.
The graalpython
benchmark offers the --nvprof
flag: if enable, only the real computation is profiled (and not the benchmark initialization).
Additionally, provide nvprof
with flags --csv
to get a CSV output, and --log-file bench-name_%p.csv"
to store the result.
Not using the flag --print-gpu-trace
will print aggregated results.
Additional metrics can be collected by nvprof
with e.g. --metrics "achieved_occupancy,sm_efficiency"
(full list).
GPUs with architecture starting from Turing (e.g. GTX 1660 Super) no longer allow collecting metrics with nvprof
, but ncu
(link) and Nsight Compute (link).
sudo /usr/local/cuda/bin/nvprof --profile-from-start off --print-gpu-trace --profile-child-processes /path/to/graalpython --jvm --polyglot --experimental-options --grcuda.InputPrefetch --grcuda.ForceStreamAttach --grcuda.RetrieveNewStreamPolicy=always-new --grcuda.ExecutionPolicy=async --grcuda.DependencyPolicy=with-const --grcuda.RetrieveParentStreamPolicy=disjoint benchmark_main.py -d -i 10 -n 4800 --no_cpu_validation --reinit false --realloc false -b b10d --block_size_1d 256 --block_size_2d 16 --nvprof
- Benchmarks are defined in the
projects/resources/python/benchmark/bench
folder, and you can create more benchmarks by inheriting from theBenchmark
class. Individual benchmarks are executed frombenchmark_main.py
, while running all benchmark is done throughbenchmark_wrapper.py
- The output of benchmarks is stored in a JSON (by default, located in
data/results
) - The benchmarking suite, through
benchmark_main.py
, supports the following options-d
,--debug
: print to the console the results and details of each benchmark. False by default-i
,--num_iter
: number of times that each benchmark is executed, for each combination of options. 30 by default-o
,--output_path
: full path to the file where results are stored. By default results are stored indata/results
, and the file name is generated automatically--realloc
: if true, allocate new memory and rebuild the GPU code at each iteration. False by default--reinit
: if true, re-initialize the values used in each benchmark at each iteration. True by default-c
,--cpu_validation
: if present, validate the result of each benchmark using the CPU (use--no_cpu_validation
to skip it instead)-b
,--benchmark
: run the benchmark only for the specified kernel. Otherwise run all benchmarks specified inbenchmark_main.py
-n
,--size
: specify the input size of data used as input for each benchmark. Otherwise use the sizes specified inbenchmark_main.py
-r
,--random
: initialize benchmarks randomly whenever possible. True by default--number_of_gpus
: Number of GPU employed for computation--execution_policy
: If present, run the benchmark only with the selected execution policy--dependency_policy
: If present, run the benchmark only with the selected dependency policy--new_stream
: If present, run the benchmark only with the selected new stream policy--parent_stream
: If present, run the benchmark only with the selected parent stream policy--device_selection
: If present and parent policy is data aware, run the benchmark only with the selected device selection heuristic--force_stream_attach
: If present, force association between arrays and CUDA streams.--memory_advise_policy
: Select a managed memory memAdvise flag, if multiple GPUs are available--prefetch
: If true, enable automatic prefetching in the benchmarks--block_size_1d
: number of threads per block when using 1D kernels--block_size_2d
: number of threads per block when using 2D kernels-g
,--number_of_blocks
: number of blocks in the computation-p
,--time_phases
: measure the execution time of each phase of the benchmark; note that this introduces overheads, and might influence the total execution time. Results for each phase are meaningful only for synchronous execution--timing
: If presentm, measure the execution time of each kernel--nvprof
: if present, enable profiling when using nvprof. For this option to have effect, run graalpython using nvprof, with flag '--profile-from-start off'
The automatic DAG scheduling of GrCUDA supports different settings that can be used for debugging or to simplify the dependency computation in some circumstances.
Starting from release 0.4.0, the automatic scheduler also supports the usage of multiple GPUs available in the system.
Different options can be provided at startup, using --experimental-options --grcuda.OptionName=value
:
EnableComputationTimers
: measure the execution time of GPU computations;false
by default;ExecutionPolicy
: this regulates the global scheduling policy;async
uses the DAG for asynchronous parallel execution, whilesync
executes each computation synchronously and can be used for debugging or to measure the execution time of each kernelDependencyPolicy
: choose how data dependencies between GrCUDA computations are computed;with-const
considers read-only parameter, whileno-const
assumes that all arguments can be modified in a computation;RetrieveNewStreamPolicy
: choose how streams for new GrCUDA computations are created;reuse
(the default) reuses free streams whenever possible, whilealways-new
creates new streams every time a computation should use a stream different from its parentRetrieveParentStreamPolicy
: choose how streams for new GrCUDA computations are obtained from parent computations;same-as-parent
simply reuse the stream of one of the parent computations, whiledisjoint
allows parallel scheduling of multiple child computations as long as their arguments are disjoint;multigpu-disjoint
extends the previous policy with multi-GPU support and select the best parent device for a given computation, whilemultigpu-early-disjoint
first selects the ideal GPU for the input computation, then finds if any of the reusable streams is allocated on that device.InputPrefetch
: iftrue
, prefetch the data on GPUs with architecture starting from Pascal. In most cases, it improves performance.false
by default;ForceStreamAttach
: iftrue
, force association between arrays and CUDA streams.true
by default on architectures older than Pascal, to allow concurrent CPU/GPU computation. On architectures starting from Pascal, it can improve performance, but it'sfalse
by default;NumberOfGPUs
: set how many GPUs can be used during computation (if available, otherwise use the max number of GPUs in the system). 1 by default;DeviceSelectionPolicy
: choose the heuristic that manages how GPU computations are mapped to devices, if multiple GPUs are available.single-gpu
by default, it supports 5 multi-GPU policies:round-robin
simply rotates the scheduling between GPUs,stream-aware
selects the device with fewer ongoing computations,min-transfer-size
maximizes data locality, whileminmin-transfer-time
andminmax-transfer-time
minimize respectively the minimum and the maximum total transfer time;DataThreshold
: When selecting a device with data-aware DeviceSelectionPolicies, such as min-transfer-size, do not give priority to devices that have less than this percentage of data already available. A lower percentage favors exploitation, a high percentage favors exploration. 0.1 by default (10%).MemAdvisePolicy
: select a managed memorymemAdvise
flag, if multiple GPUs are available. Options:read-mostly
,preferred-location
,none
(default);ExportDAG
: if present, dump the scheduling DAG in .dot format. Specify the destination path and the file name as value of the option (e.g.--grcuda.ExportDAG=../ExecutionDAG
).BandwidthMatrix
: if present, sets the location of the CSV file that contains the estimated bandwidth between each CPU and GPU in the system, employed by topology-aware DeviceSelectionPolicies. By default, this is taken from$GRCUDA_HOME/projects/resouces/connection_graph/datasets/connection_graph.csv
, which is automatically computed during GrCUDA installation.
If you use GrCUDA in your research, please cite the following publication(s). Thank you!
Parravicini, A., Delamare, A., Arnaboldi, M., & Santambrogio, M. D. (2021, May). DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (pp. 111-120). IEEE.