opal/cuda: Handle stream-ordered allocations and assign primary device context #12835

Akshay-Venkatesh · 2024-09-30T22:34:45Z

This PR is similar to #12757 and it detects if a given pointer belongs to CUDA memory pools.

Additionally, we assign primary device context to calling thread in the absence of a device context as VMM and memory pool pointers do not have a device context associated with them by design. While ideally we release references on the primary context by making an equal number of cuDevicePrimaryCtxRelease calls as cuDevicePrimaryCtxRetain calls, unfortunately there is no good place to make the release call as we don't know up front when the last reference to the given pointer will be in process lifetime (especially as we don't intercept user CUDA calls to free said memory). For this reason, there will be at most one unreleased reference against the primary device context per user process thread and this shouldn't have any undesired effects on GPU resources as such.

bosilca

Have you ever checked that the mpool are supported in the setup (using the CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED attribute for cuDeviceGetAttribute) ?

Akshay-Venkatesh · 2024-10-01T17:12:49Z

Have you ever checked that the mpool are supported in the setup (using the CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED attribute for cuDeviceGetAttribute) ?

Hi @bosilca I have not. Are you suggesting we check the attribute and always return 0 for cuda_check_mpool if mpools are unsupported? If so, DeviceAttribute queries are more expensive and I would say that Pointer Attribute query we have effectively does the same but with lesser overhead. But if we assume homogeneity of GPUs then I agree that your suggestion would eliminate even the pointer query if we have a static variable initialized to reflect mpool support. Question then is if we can assume homogeneity. What are your thoughts on this?

bosilca · 2024-10-01T17:48:05Z

We can determine all that upfront. Check for mpool support once for each visible devices, generate a warning if they are different and activate the most generic checks for the pointers. If homogeneous, we can remove some of the checks.

I wonder how many heterogeneous setups are really used for anything else than toys ?

Akshay-Venkatesh · 2024-10-01T18:03:51Z

I wonder how many heterogeneous setups are really used for anything else than toys ?

Indeed this will not be common.

bosilca · 2024-10-03T17:44:02Z

opal/mca/accelerator/cuda/accelerator_cuda.c

+    CUmemAccess_flags flags;
+    CUmemLocation location;
+
+    if (device_count == -1) {


The logic here is a little too lax for the cases with 0 devices. If I understand the code correctly, without a device (aka. without a valid device 0) the call to cuDeviceGetAttribute will never change mpool_supported, so the check cuDeviceGetAttribute will be executed every time. You should bail out of this function if device_count is 0.

…e context Signed-off-by: Akshay Venkatesh <[email protected]>

Akshay-Venkatesh requested review from bosilca and janjust September 30, 2024 22:34

github-actions bot added the Target: main label Sep 30, 2024

Akshay-Venkatesh force-pushed the topic/handle-masync-assign-ctx branch from a251ab3 to 419e9b7 Compare September 30, 2024 22:40

bosilca reviewed Oct 1, 2024

View reviewed changes

Akshay-Venkatesh force-pushed the topic/handle-masync-assign-ctx branch from 419e9b7 to 5328616 Compare October 2, 2024 21:53

Akshay-Venkatesh requested a review from bosilca October 2, 2024 21:53

Akshay-Venkatesh mentioned this pull request Oct 3, 2024

opal/cuda: Handle stream-ordered allocations and assign primary device context #12841

Merged

bosilca reviewed Oct 3, 2024

View reviewed changes

Akshay-Venkatesh force-pushed the topic/handle-masync-assign-ctx branch from 5328616 to 6516714 Compare October 3, 2024 18:36

janjust approved these changes Oct 3, 2024

View reviewed changes

opal/cuda: Handle stream-ordered allocations and assign primary devic…

cafcce9

…e context Signed-off-by: Akshay Venkatesh <[email protected]>

Akshay-Venkatesh force-pushed the topic/handle-masync-assign-ctx branch from 6516714 to cafcce9 Compare October 3, 2024 20:53

bosilca approved these changes Oct 3, 2024

View reviewed changes

Akshay-Venkatesh mentioned this pull request Oct 3, 2024

5.0.x/opal/cuda: Handle stream-ordered allocations and assign primary device #12843

Open

janjust approved these changes Oct 4, 2024

View reviewed changes

janjust merged commit 041a904 into open-mpi:main Oct 4, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opal/cuda: Handle stream-ordered allocations and assign primary device context #12835

opal/cuda: Handle stream-ordered allocations and assign primary device context #12835

Akshay-Venkatesh commented Sep 30, 2024

bosilca left a comment

Akshay-Venkatesh commented Oct 1, 2024 •

edited

Loading

bosilca commented Oct 1, 2024

Akshay-Venkatesh commented Oct 1, 2024

bosilca Oct 3, 2024

opal/cuda: Handle stream-ordered allocations and assign primary device context #12835

opal/cuda: Handle stream-ordered allocations and assign primary device context #12835

Conversation

Akshay-Venkatesh commented Sep 30, 2024

bosilca left a comment

Choose a reason for hiding this comment

Akshay-Venkatesh commented Oct 1, 2024 • edited Loading

bosilca commented Oct 1, 2024

Akshay-Venkatesh commented Oct 1, 2024

bosilca Oct 3, 2024

Choose a reason for hiding this comment

Akshay-Venkatesh commented Oct 1, 2024 •

edited

Loading