Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax test for async memory pool IPC handle support #1130

Merged
merged 6 commits into from
Oct 14, 2022

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Oct 13, 2022

Description

Replaces NVIDIA/nvidia-docker#1121. This PR resolves an inconsistency in the Python tests for IPC handle support. This issue has existed for at least a few months but the failure was not noticed in our nightly CI until switching to GitHub Actions. The core problem was that we were checking driver versions to determine support for IPC handles, which turned out to be a bad course of action. Instead, we should rely on checking for the feature support directly, with cudart.cudaDeviceGetAttribute(cudart.cudaDeviceAttr.cudaDevAttrMemoryPoolSupportedHandleTypes, device_id). This is handled by the C++ code in rmm, and we now defer to that logic in Python tests.

After some collaborative debugging, we found that our CI runners with driver 450 (CUDA 11.0) and newer CUDA toolkit versions like 11.4 and 11.5 were reporting the driver version incorrectly, and returned a value equal to the container's runtime version. This appears to stem from the same issue reported in NVIDIA/nvidia-container-toolkit#291 and NVIDIA/libnvidia-container#138. This seems to be due to how the cuda-compat package injects (forward?) compatibility support into containers. Both the driver and runtime claimed to be new enough to support IPC handles, which require CUDA 11.3, despite the driver being older than 11.3. This meant that the attempt to use an IPC handle was rejected by the C++ code at runtime and the Python test failed. The Python code no longer attempts to determine if IPC should be supported according to driver/runtime versions, because this is not valid for all the configurations in CI. Creating a valid check for IPC handle support in the Python layer is complicated, due to Docker driver version issues mentioned above, so we just ensure the test fails in a predictable way if the driver/runtime do not support IPC.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the Python Related to RMM Python API label Oct 13, 2022
@bdice bdice added bug Something isn't working non-breaking Non-breaking change labels Oct 14, 2022
@bdice bdice self-assigned this Oct 14, 2022
@bdice bdice marked this pull request as ready for review October 14, 2022 16:46
@bdice bdice requested a review from a team as a code owner October 14, 2022 16:46
@bdice
Copy link
Contributor Author

bdice commented Oct 14, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 5a6d7a6 into rapidsai:branch-22.12 Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Related to RMM Python API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants