Add some documentation for memory_resource (#1217)

* Add some documentation for `memory_resource` Co-authored-by: Mark Harris <[email protected]> Co-authored-by: Jake Hemstad <[email protected]>
NVIDIA · Jan 19, 2024 · 9e3516d · 9e3516d
1 parent 5b0762a
commit 9e3516d
Show file tree

Hide file tree

Showing 9 changed files with 336 additions and 17 deletions.
diff --git a/libcudacxx/docs/_config.yml b/libcudacxx/docs/_config.yml
@@ -100,6 +100,14 @@ defaults:
       has_children: true
       has_toc: false
       nav_order: 6
+  -
+    scope:
+      path: extended_api/memory_resource.md
+    values:
+      parent: Extended API
+      has_children: true
+      has_toc: false
+      nav_order: 7
   -
     scope:
       path: releases/changelog.md

diff --git a/libcudacxx/docs/extended_api.md b/libcudacxx/docs/extended_api.md
@@ -21,6 +21,8 @@ nav_order: 3
 
 {% include_relative extended_api/functional.md %}
 
+{% include_relative extended_api/memory_resource.md %}
+
 {% include_relative extended_api/ptx.md %}
 
 [Thread Scopes]: ./extended_api/memory_model.md#thread-scopes

diff --git a/libcudacxx/docs/extended_api/memory_resource.md b/libcudacxx/docs/extended_api/memory_resource.md
@@ -0,0 +1,18 @@
+## memory_resource
+
+The `<cuda/memory_resource>` header provides a standard C++ interface for *heterogeneous*, *stream-ordered* memory allocation tailored to the needs of CUDA C++ developers. This design builds off of the success of the [RAPIDS Memory Manager (RMM)](https://github.com/rapidsai/rmm) project and evolves the design based on lessons learned. `<cuda/memory_resource>` is not intended to replace RMM, but instead moves the definition of the memory allocation interface to a more centralized home in CCCL. RMM will remain as a collection of implementations of the `cuda::mr` interfaces. 
+
+At a high level, the header provides:
+
+1. the [`cuda::get_property`] infrastructure to tag a user defined type with a given property;
+2. the [`cuda::mr::{async}_resource` and `cuda::mr::{async}_resource_with`] concepts that provide proper constraints for arbitrary memory resources;
+3. the [`cuda::mr::{async}_resource_ref`] is a type-erased memory resource wrapper that enables consumers to specify properties of resources that they expect.
+
+These features are an evolution of `std::pmr::memory_resource` that was introduced in C++17. While `std::pmr::memory_resource` provides a polymorphic memory resource that can be adopted through inheritance, it is not properly suited for heterogeneous systems. With the current design it ranges from cumbersome to impossible to verify whether a memory resource provides allocations that are e.g. accessible on device, or whether it can utilize other allocation mechanisms.
+
+To better support asynchronous CUDA [stream-ordered allocations](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-ordered-memory-allocator) libcu++ provides [`cuda::stream_ref`] as a wrapper around `cudaStream_t`. The definition of `cuda::stream_ref` can be found in the `<cuda/stream_ref>` header.
+
+[`cuda::get_property`]: {{ "extended_api/memory_resource/properties.html" | relative_url }}
+[`cuda::mr::{async}_resource` and `cuda::mr::{async}_resource_with`]: {{ "extended_api/memory_resource/resource.html" | relative_url }}
+[`cuda::mr::{async}_resource_ref`]: {{ "extended_api/memory_resource/resource_ref.html" | relative_url }}
+[`cuda::stream_ref`]: {{ "extended_api/memory_resource/stream_ref.html" | relative_url }}
diff --git a/libcudacxx/docs/extended_api/memory_resource/properties.md b/libcudacxx/docs/extended_api/memory_resource/properties.md
@@ -0,0 +1,76 @@
+---
+grand_parent: Extended API
+parent: memory_resource
+nav_order: 1
+---
+
+### Properties
+
+Modern C++ programs use the type system to verify statically known properties of the code. It is undesirable to use run-time assertions to verify whether a function that accesses device memory is passed a memory resource that provides device-accessible allocations. A property system is provided for this purpose.
+
+To tell the compiler that memory provided by `my_memory_resource` is device accessible, use the `cuda::mr::device_accessible` tag type with a free function `get_property` to declare that the resource provides device-accessible memory.
+
+```c++
+struct my_memory_resource {
+    friend constexpr void get_property(const my_memory_resource&, cuda::mr::device_accessible) noexcept {}
+};
+```
+
+A library can constrain interfaces with `cuda::has_property` to require that a passed memory resource provides the right kind of memory (See [here](https://godbolt.org/z/5hjoEnerb) for a minimal Compiler Explorer example).
+
+```c++
+template<class MemoryResource>
+    requires cuda::has_property<MemoryResource, cuda::mr::device_accessible>
+void function_that_dispatches_to_device(MemoryResource& resource);
+```
+
+If C++20 is not available, the function can instead be constrained via SFINAE (See [here](https://godbolt.org/z/11sGbr333) for a minimal Compiler Explorer example).
+
+```c++
+template<class MemoryResource, class = cuda::std::enable_if_t<cuda::has_property<MemoryResource, cuda::mr::device_accessible>>>
+void function_that_dispatches_to_device(MemoryResource& resource);
+```
+
+For now, libcu++ provides two commonly used properties: `cuda::mr::device_accessible` and `cuda::mr::host_accessible`. More properties may be added as the library and the hardware capabilities evolve. However, a user library is free to define as many properties as needed to fully cover its API surface.
+
+Note that the libcu++ provided properties are stateless. However, properties can also provide stateful information that is retrieved via the `get_property` free function. In order to communicate the desired type of the carried state, a stateful property must define the `value_type` alias. A library can constrain interfaces that require a stateful property with `cuda::has_property_with` as shown in the example below (See [here](https://godbolt.org/z/11sGbr333) for a minimal Compiler Explorer example).
+
+```c++
+struct required_alignment{
+    using value_type = std::size_t;
+};
+struct my_memory_resource {
+    friend constexpr std::size_t get_property(const my_memory_resource& resource, required_alignment) noexcept { return resource.required_alignment; }
+
+    std::size_t required_alignment;
+};
+static_assert(cuda::has_property_with<my_memory_resource, required_alignment, std::size_t>);
+
+template<class MemoryResource>
+void* allocate_check_alignment(MemoryResource& resource, std::size_t size) {
+    if constexpr(cuda::has_property_with<MemoryResource, required_alignment, std::size_t>) {
+        return resource.allocate(size, get_property(resource, required_alignment{}));
+    } else {
+        // Use default alignment
+        return resource.allocate(size, 42);
+    }
+}
+```
+
+In generic code it is often desireable to propagate properties from a base type to a derived type, without knowing the base type at all. This common use case is covered by `cuda::forward_property`, a simple CRTP template.
+
+```c++
+template<class MemoryResource>
+class logging_resource : cuda::forward_property<logging_resource<MemoryResource>, MemoryResource> {
+    MemoryResource base;
+public:
+    void* allocate(std::size_t size, std::size_t alignment) {
+        std::cout << "allocating\n";
+        return base.allocate(size, alignment);
+    }
+    void deallocate(void* ptr, std::size_t size, std::size_t alignment) {
+        std::cout << "deallocating\n";
+        return base.deallocate(ptr, size, alignment);
+    }
+}
+```
diff --git a/libcudacxx/docs/extended_api/memory_resource/resource.md b/libcudacxx/docs/extended_api/memory_resource/resource.md
@@ -0,0 +1,138 @@
+---
+grand_parent: Extended API
+parent: memory_resource
+nav_order: 2
+---
+
+### The `resource` concept
+
+The `std::pmr::memory_resource` feature provides only a single `allocate` interface, which is sufficient for homogeneous memory systems. However, CUDA provides both synchronous and [stream-ordered allocation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-ordered-memory-allocator).
+
+With `std::pmr::memory_resource` there is no way to tell whether a memory resource can utilize stream-ordered allocations. Even if the application knows it can, there is no way to properly tell the memory resource to use stream-ordered allocation. Ideally, this should not be something discovered through an assert at run time, but should be checked by the compiler.
+
+The `cuda::mr::resource` concept provides basic type checks to ensure that a given memory resource provides the expected `allocate`/`deallocate` interface and is also equality comparable, which covers the whole API surface of `std::pmr::memory_resource`. See below for different memory resources and potential pitfals.
+
+To demonstrate, the following example defines several resources, only some of which are valid implementations of the `cuda::mr::resource` concept. The `static_assertion`s will result in compile-time errors for the invalid resources.
+
+```c++
+struct valid_resource {
+  void* allocate(std::size_t, std::size_t) { return nullptr; }
+  void deallocate(void*, std::size_t, std::size_t) {}
+  bool operator==(const valid_resource&) const { return true; }
+  // NOTE: C++20 thankfully added operator rewrite rules so defining operator!= is not required.
+  // However, if compiled with C++14 / C++17, operator != must also be defined.
+  bool operator!=(const valid_resource&) const { return false; }
+};
+static_assert(cuda::mr::resource<valid_resource>, "");
+
+struct invalid_argument {};
+struct invalid_allocate_argument {
+  void* allocate(invalid_argument, std::size_t) { return nullptr; }
+  void deallocate(void*, std::size_t, std::size_t) {}
+  bool operator==(const invalid_allocate_argument&) { return true; }
+};
+static_assert(!cuda::mr::resource<invalid_allocate_argument>, "");
+
+struct invalid_allocate_return {
+  int allocate(std::size_t, std::size_t) { return 42; }
+  void deallocate(void*, std::size_t, std::size_t) {}
+  bool operator==(const invalid_allocate_return&) { return true; }
+};
+static_assert(!cuda::mr::resource<invalid_allocate_return>, "");
+
+struct invalid_deallocate_argument {
+  void* allocate(std::size_t, std::size_t) { return nullptr; }
+  void deallocate(void*, invalid_argument, std::size_t) {}
+  bool operator==(const invalid_deallocate_argument&) { return true; }
+};
+static_assert(!cuda::mr::resource<invalid_deallocate_argument>, "");
+
+struct non_comparable {
+  void* allocate(std::size_t, std::size_t) { return nullptr; }
+  void deallocate(void*, std::size_t, std::size_t) {}
+};
+static_assert(!cuda::mr::resource<non_comparable>, "");
+
+struct non_eq_comparable {
+  void* allocate(std::size_t, std::size_t) { return nullptr; }
+  void deallocate(void*, std::size_t, std::size_t) {}
+  bool operator!=(const non_eq_comparable&) { return false; }
+};
+static_assert(!cuda::mr::resource<non_eq_comparable>, "");
+```
+
+In addition to the `std::pmr::memory_resource` interface the `cuda::mr::async_resource` concept verifies that a memory resource also satisfies the `allocate_async` / `deallocate_async` interface. Requiring both the PMR interface and the async interface is a deliberate design decision.
+
+```c++
+struct valid_resource {
+  void* allocate(std::size_t, std::size_t) { return nullptr; }
+  void deallocate(void*, std::size_t, std::size_t) {}
+  void* allocate_async(std::size_t, std::size_t, cuda::stream_ref) { return nullptr; }
+  void deallocate_async(void*, std::size_t, std::size_t, cuda::stream_ref) {}
+  bool operator==(const valid_resource&) const { return true; }
+  bool operator!=(const valid_resource&) const { return false; }
+};
+static_assert(cuda::mr::async_resource<valid_resource>, "");
+```
+
+A library can easily decide whether to use the async interface:
+
+```c++
+template<class MemoryResource>
+    requires cuda::mr::resource<MemoryResource>
+void* maybe_allocate_async(MemoryResource& resource, std::size_t size, std::size_t align, cuda::stream_ref stream) {
+    if constexpr(cuda::mr::async_resource<MemoryResource>) {
+        return resource.allocate_async(size, align, stream);
+    } else {
+        return resource.allocate(size, align);
+    }
+}
+```
+
+### Putting them together
+
+Applications and libraries may want to combine type checks for arbitrary properties with the `{async_}resource` concept. The `{async_}resource_with` concept allows checking resources for arbitrary properties.
+
+```c++
+struct required_alignment{
+    using value_type = std::size_t;
+};
+struct my_memory_resource {
+    void* allocate(std::size_t, std::size_t) { return nullptr; }
+    void deallocate(void*, std::size_t, std::size_t) {}
+    bool operator==(const my_memory_resource&) const { return true; }
+    bool operator!=(const my_memory_resource&) const { return false; }
+
+    friend constexpr std::size_t get_property(const my_memory_resource& resource, required_alignment) noexcept { return resource.required_alignment; }
+
+    std::size_t required_alignment;
+};
+
+template<class MemoryResource>
+    requires cuda::mr::resource<MemoryResource>
+void* maybe_allocate_async_check_alignment(MemoryResource& resource, std::size_t size, cuda::stream_ref stream) {
+    if constexpr(cuda::mr::async_resource_with<MemoryResource, required_alignment>) {
+        return resource.allocate_async(size, get_property(resource, required_alignment), stream);
+    } else if constexpr (cuda::mr::async_resource<MemoryResource>) {
+        return resource.allocate_async(size, my_default_alignment, stream);
+    } else if constexpr (cuda::mr::resource_with<MemoryResource, required_alignment>) {
+        return resource.allocate(size, get_property(resource, required_alignment));
+    } else {
+        return resource.allocate(size, my_default_alignment);
+    }
+}
+
+// Potentially more concise
+template<class MemoryResource>
+    requires cuda::mr::resource<MemoryResource>
+void* maybe_allocate_async_check_alignment2(MemoryResource& resource, std::size_t size, cuda::stream_ref stream) {
+    constexpr std::size_t align = cuda::mr::resource_with<MemoryResource, required_alignment>
+                                ? get_property(resource, required_alignment)
+                                : my_default_alignment;
+    if constexpr(cuda::mr::async_resource<MemoryResource>) {
+        return resource.allocate_async(size, align, stream);
+    } else {
+        return resource.allocate(size, align);
+    }
+}
+```
diff --git a/libcudacxx/docs/extended_api/memory_resource/resource_ref.md b/libcudacxx/docs/extended_api/memory_resource/resource_ref.md
@@ -0,0 +1,49 @@
+---
+grand_parent: Extended API
+parent: memory_resource
+nav_order: 3
+---
+
+### `resource_ref`: a type-constrained resource wrapper
+
+With the property design depicted in [`cuda::get_property`](cuda::get_property), a library has flexibility in checking constraints and querying custom properties. However, there is also a cost in providing function templates for a potentially wide range of inputs. Depending on the number of different memory resources, both compile time and binary size might increase considerably.
+
+The type-erased `resource_ref` and `async_resource_ref` resource wrappers aid in efficiently coalescing such APIs into a single function.
+
+```c++
+void* do_allocate_async(cuda::mr::async_resource_ref<> resource, std::size_t size, std::size_t align, cuda::stream_ref stream) {
+    return resource.allocate_async(size, align, stream);
+}
+
+my_async_memory_resource resource;
+my_async_memory_resource* pointer_to_resource = &resource;
+
+void* from_reference = do_allocate_async(resource, 1337, 256, cuda::stream_ref{});
+void* from_ptr = do_allocate_async(pointer_to_resource, 1337, 256, cuda::stream_ref{});
+```
+
+Note that `do_allocate_async` is not a template anymore but a plain old function. The wrapper `cuda::mr::{async_}resource_ref<>` is constructible from any non-const reference or pointer to a memory resource that satisfies `cuda::mr::{async_}resource`.
+
+Properties may also be passed to `cuda::mr::{async_}resource_ref` just as with `cuda::mr::resource_with`.
+
+```c++
+struct required_alignment{};
+void* do_allocate_async_with_alignment(cuda::mr::async_resource_ref<required_alignment> resource, std::size_t size, cuda::stream_ref stream) {
+    return resource.allocate_async(size, cuda::mr::get_property(resource, required_alignment), stream);
+}
+```
+
+However, the type erasure comes with the cost that arbitrary properties cannot be queried:
+
+```c++
+struct required_alignment{};
+void* buggy_allocate_async_with_alignment(cuda::mr::async_resource_ref<> resource, std::size_t size, cuda::stream_ref stream) {
+    if constexpr (cuda::has_property<required_alignment>) { // BUG: This will always be false
+        return resource.allocate_async(size, cuda::mr::get_property(resource, required_alignment), stream);
+    } else {
+        return resource.allocate_async(size, my_default_alignment, stream);
+    }
+}
+```
+
+So, choose wisely. If your library has a well-defined set of fixed properties that you expect to always be available, then `cuda::mr::{async_}resource_ref` is an amazing tool to improve compile times and binary size. If you need a flexible interface then constraining through `cuda::mr::{async_}resource_with` is the proper solution.
diff --git a/libcudacxx/docs/extended_api/memory_resource/stream_ref.md b/libcudacxx/docs/extended_api/memory_resource/stream_ref.md
@@ -0,0 +1,23 @@
+---
+grand_parent: Extended API
+parent: memory_resource
+nav_order: 4
+---
+
+### `stream_ref`: a wrapper around a `cudaStream_t`
+
+CUDA [stream-ordered allocations](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-ordered-memory-allocator) rely on `cudaStream_t` as a handle to the cuda stream.
+
+However, as this is just an alias for a plain pointer type it carries with it a lot of common pitfals around implicit conversions from e.g `nullptr` or a literal `0`.
+
+These hard to spot bugs can be avoided through `cuda::stream_ref`, which is a simple wrapper around a `cudaStream_t` that prevents implicit conversions. It also provides the `wait()` and `ready()` member functions to facilitate waiting for a stream to finish and checking whether it is finished.
+
+```cpp
+    cudaStream_t stream;
+    cudaStreamCreate(&stream);
+    cuda::stream_ref ref{stream};
+
+    ref.wait();          // synchronizes the stream via cudaStreamSynchronize
+    assert(ref.ready()); // verifies that the stream has finished all operations via cudaStreamQuery
+    cudaStreamDestroy(stream);
+```
diff --git a/libcudacxx/include/cuda/memory_resource b/libcudacxx/include/cuda/memory_resource
@@ -82,14 +82,21 @@ class resource_ref {
 
 #ifdef LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE
 
+// Needs to come first due to cuda_runtime_api
 #include <cuda/stream_ref>
 
-#include <cuda/std/concepts>
-#include <cuda/std/type_traits>
+#include <cuda/__cccl_config>
 
-#include <cuda/std/detail/__config>
+#if defined(_CCCL_IMPLICIT_SYSTEM_HEADER_GCC)
+#  pragma GCC system_header
+#elif defined(_CCCL_IMPLICIT_SYSTEM_HEADER_CLANG)
+#  pragma clang system_header
+#elif defined(_CCCL_IMPLICIT_SYSTEM_HEADER_MSVC)
+#  pragma system_header
+#endif // no system header
 
-#include <cuda/std/detail/__pragma_push>
+#include <cuda/std/concepts>
+#include <cuda/std/type_traits>
 
 #if _CCCL_STD_VER > 2011
 _LIBCUDACXX_BEGIN_NAMESPACE_CUDA
@@ -172,8 +179,7 @@ template <class _Derived, class _Upstream>
 using forward_property = __forward_property::__fn<_Derived, _Upstream>;
 
 /// class get_property
-/// \brief The \c get_property crtp temaplate simplifies the user facing side of forwarding properties
-///        We can always tell people to just derive from it to properly forward all properties
+/// \brief The \c get_property customization point ensures that `cuda::get_property` is always available
 _LIBCUDACXX_BEGIN_NAMESPACE_CPO(__get_property)
 struct __fn {
   _CCCL_EXEC_CHECK_DISABLE
@@ -622,10 +628,10 @@ struct device_accessible{};
 struct host_accessible{};
 
 } // namespace mr
+
 _LIBCUDACXX_END_NAMESPACE_CUDA
-#endif // _CCCL_STD_VER > 2011
 
-#include <cuda/std/detail/__pragma_pop>
+#endif // _CCCL_STD_VER > 2011
 
 #endif // LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE