Skip to content

The device_matrix library is a lightweight, transparent, object-oriented and templated C++ library that encapsulates CUDA memory objects (i.e., tensors) and defines common operations on them.

License

Notifications You must be signed in to change notification settings

cvangysel/device_matrix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The device_matrix library

device_matrix is a lightweight, transparent, object-oriented and templated C++ library that encapsulates CUDA memory objects (i.e., tensors) and defines common operations on them.

Requirements & installation

To build the library and manage dependencies, we use CMake (version 3.5 and higher). In addition, we rely on the following libraries:

  • CUDA (version 8 and higher preferred), and
  • glog (version 0.3.4 and higher).

The cnmem library is used for memory management. The tests are implemented using the googletest and googlemock frameworks. CMake will fetch and compile these libraries automatically as part of the build pipeline. Finally, you need a CUDA-compatible GPU in order to perform any computations.

To install device_matrix, the following instructions should get you started.

git clone https://github.com/cvangysel/device_matrix
cd device_matrix
mkdir build
cd build
cmake ..
make
make test
make install

Please refer to the CMake documentation for advanced options.

Examples

The following examples can also be found in the examples sub-directory of this repository. These examples will also be compiled as part of the build process.

Matrix multiplication

#include <device_matrix/device_matrix.h>
	
#include <glog/logging.h>
#include <memory>

using namespace cuda;
	
int main(int argc, char* argv[]) {
    google::InitGoogleLogging(argv[0]);
	
    const cudaStream_t stream = 0; // default CUDA stream.
	
    std::unique_ptr<device_matrix<float32>> a(
        device_matrix<float32>::create(
            stream,
            {1.0, 2.0, 3.0, 4.0, 5.0, 6.0},
            2 /* num_rows */, 3 /* num_columns */));
	
    std::unique_ptr<device_matrix<float32>> b(
        device_matrix<float32>::create(
            stream,
            {7.0, 8.0, 9.0, 10.0, 11.0, 12.0},
            3 /* num_rows */, 2 /* num_columns */));
	
    device_matrix<float32> c(
        2 /* num_rows */, 2 /* num_columns */, stream);
	
    matrix_mult(stream,
                *a, CUBLAS_OP_N,
                *b, CUBLAS_OP_N,
                &c);
	
    cudaDeviceSynchronize();
	
    print_matrix(c);
}

Custom CUDA kernels

#include <device_matrix/device_matrix.h>

#include <glog/logging.h>
#include <memory>

using namespace cuda;

template <typename FloatT>
__global__
void inverse_kernel(FloatT* const input) {
    size_t offset = threadIdx.y * blockDim.x + threadIdx.x;
    input[offset] = -input[offset];
}

int main(int argc, char* argv[]) {
    google::InitGoogleLogging(argv[0]);

    const cudaStream_t stream = 0; // default CUDA stream.

    std::unique_ptr<device_matrix<float32>> a(
        device_matrix<float32>::create(
            stream,
            {1.0, 2.0, 3.0, 4.0, 5.0, 6.0},
            2 /* num_rows */, 3 /* num_columns */));

    LAUNCH_KERNEL(
        inverse_kernel
            <<<1, /* a single block */
               dim3(a->getRows(), a->getCols()), /* one thread per component */
               0,
               stream>>>(
            a->getData()));

    cudaDeviceSynchronize();

    print_matrix(*a);
}

Design principles

device_matrix was explicitly designed to be inflexible with regards to variable passing/assignment as the lifetime of a device_matrix instance directly corresponds to the lifetime of the CUDA memory region it has allocated. That means that CUDA memory remains allocated as long as its underlying device_matrix exists and that device_matrix instances can only be passed as pointers or references. This gives total control of the CUDA memory allocation to the programmer, as it avoids garbage collection (e.g., Torch) or reference counting (e.g., shared_ptr), and allows for optimized CUDA memory usage. It uses cnmem for its memory management in order to avoid performance issues that occur due to the recurrent re-allocation of memory blocks of a particular size.

To avoid the implicit allocation of on-device memory, any operation resulting in a new allocation needs to be explicit in this. Most operations that return a new result will therefore reuse one of its inputs as destination memory space (in the process, the original input values will be overwritten!). As a result of this, C++ operators that imply value modification were deliberately omitted.

The underlying CUDA memory space can easily be accessed by the library user. This allows the user to write arbitrary CUDA kernels that perform non-standard operations on CUDA objects in-place.

License

device_matrix is licensed under the MIT license. CUDA is a licensed trademark of NVIDIA. Please note that CUDA is licensed separately.

If you modify device_matrix in any way, please link back to this repository.

About

The device_matrix library is a lightweight, transparent, object-oriented and templated C++ library that encapsulates CUDA memory objects (i.e., tensors) and defines common operations on them.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published