Relax test for async memory pool IPC handle support #1130
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Replaces NVIDIA/nvidia-docker#1121. This PR resolves an inconsistency in the Python tests for IPC handle support. This issue has existed for at least a few months but the failure was not noticed in our nightly CI until switching to GitHub Actions. The core problem was that we were checking driver versions to determine support for IPC handles, which turned out to be a bad course of action. Instead, we should rely on checking for the feature support directly, with
cudart.cudaDeviceGetAttribute(cudart.cudaDeviceAttr.cudaDevAttrMemoryPoolSupportedHandleTypes, device_id)
. This is handled by the C++ code in rmm, and we now defer to that logic in Python tests.After some collaborative debugging, we found that our CI runners with driver 450 (CUDA 11.0) and newer CUDA toolkit versions like 11.4 and 11.5 were reporting the driver version incorrectly, and returned a value equal to the container's runtime version. This appears to stem from the same issue reported in NVIDIA/nvidia-container-toolkit#291 and NVIDIA/libnvidia-container#138. This seems to be due to how the
cuda-compat
package injects (forward?) compatibility support into containers. Both the driver and runtime claimed to be new enough to support IPC handles, which require CUDA 11.3, despite the driver being older than 11.3. This meant that the attempt to use an IPC handle was rejected by the C++ code at runtime and the Python test failed. The Python code no longer attempts to determine if IPC should be supported according to driver/runtime versions, because this is not valid for all the configurations in CI. Creating a valid check for IPC handle support in the Python layer is complicated, due to Docker driver version issues mentioned above, so we just ensure the test fails in a predictable way if the driver/runtime do not support IPC.Checklist