Relax test for async memory pool IPC handle support #1130

bdice · 2022-10-13T22:00:38Z

Description

Replaces NVIDIA/nvidia-docker#1121. This PR resolves an inconsistency in the Python tests for IPC handle support. This issue has existed for at least a few months but the failure was not noticed in our nightly CI until switching to GitHub Actions. The core problem was that we were checking driver versions to determine support for IPC handles, which turned out to be a bad course of action. Instead, we should rely on checking for the feature support directly, with cudart.cudaDeviceGetAttribute(cudart.cudaDeviceAttr.cudaDevAttrMemoryPoolSupportedHandleTypes, device_id). This is handled by the C++ code in rmm, and we now defer to that logic in Python tests.

After some collaborative debugging, we found that our CI runners with driver 450 (CUDA 11.0) and newer CUDA toolkit versions like 11.4 and 11.5 were reporting the driver version incorrectly, and returned a value equal to the container's runtime version. This appears to stem from the same issue reported in NVIDIA/nvidia-container-toolkit#291 and NVIDIA/libnvidia-container#138. This seems to be due to how the cuda-compat package injects (forward?) compatibility support into containers. Both the driver and runtime claimed to be new enough to support IPC handles, which require CUDA 11.3, despite the driver being older than 11.3. This meant that the attempt to use an IPC handle was rejected by the C++ code at runtime and the Python test failed. The Python code no longer attempts to determine if IPC should be supported according to driver/runtime versions, because this is not valid for all the configurations in CI. Creating a valid check for IPC handle support in the Python layer is complicated, due to Docker driver version issues mentioned above, so we just ensure the test fails in a predictable way if the driver/runtime do not support IPC.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

bdice · 2022-10-14T22:33:43Z

@gpucibot merge

bdice added 2 commits October 13, 2022 16:53

Fix typo.

b53bf04

Rewrite test to accept more error cases.

4e49986

github-actions bot added the Python Related to RMM Python API label Oct 13, 2022

bdice added 2 commits October 14, 2022 11:11

Remove Python side error check.

497d8b0

Catch correct error type.

13c026e

bdice added bug Something isn't working non-breaking Non-breaking change labels Oct 14, 2022

bdice self-assigned this Oct 14, 2022

Add TODO.

cc3bebf

bdice marked this pull request as ready for review October 14, 2022 16:46

bdice requested a review from a team as a code owner October 14, 2022 16:46

Update string assertion.

2eff538

ajschmidt8 mentioned this pull request Oct 14, 2022

Fix determination of IPC support in rmm Python and test suite #1121

Closed

3 tasks

shwina approved these changes Oct 14, 2022

View reviewed changes

rapids-bot bot merged commit 5a6d7a6 into rapidsai:branch-22.12 Oct 14, 2022

bdice mentioned this pull request Oct 25, 2022

HOTFIX: Update cuda-python dependency to 11.7.1 #1136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax test for async memory pool IPC handle support #1130

Relax test for async memory pool IPC handle support #1130

bdice commented Oct 13, 2022 •

edited

Loading

bdice commented Oct 14, 2022

Relax test for async memory pool IPC handle support #1130

Relax test for async memory pool IPC handle support #1130

Conversation

bdice commented Oct 13, 2022 • edited Loading

Description

Checklist

bdice commented Oct 14, 2022

bdice commented Oct 13, 2022 •

edited

Loading