Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace use of custom CUDA bindings with CUDA-Python #930

Merged
merged 18 commits into from
Jan 19, 2022

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Dec 3, 2021

This PR removes many of the custom CUDA bindings we wrote in RMM to support calls to the driver/runtime APIs from Python in downstream libraries (cudf, cuml, cugraph). We should now use CUDA Python instead.

However, the module rmm._cuda.gpu is not being removed. It has been converted from an extension module (.pyx) to a regular .py module. This module contains high-level wrappers around raw CUDA bindings, with some niceties like converting errors to exceptions with the appropriate error message. Reimplementing that functionality in each downstream library would be a bad idea. When CUDA Python rolls its own higher-level API, we can remove the gpu module as well.

One API change worth mentioning here is to the function rmm._cuda.gpu.getDeviceAttribute. Previously, the API accepted a cudaDeviceAttr, a type defined as part of RMM's custom CUDA bindings. The API has now changed to accept a cudaDeviceAttr defined in CUDA-Python. This requires changes in downstream libraries that use this API.

I am marking this PR non-breaking as it does not affect the user-facing API. It does cause breakages in downstream libraries that are currently relying on internal APIs (from the rmm._cuda module).

@shwina shwina requested a review from a team as a code owner December 3, 2021 15:41
@github-actions github-actions bot added the Python Related to RMM Python API label Dec 3, 2021
@shwina shwina marked this pull request as draft December 3, 2021 16:06
@jakirkham
Copy link
Member

Seeing all of the code dropped here is great! Thanks for working on this Ashwin! 😄

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @shwina! Thanks for doing this. I reviewed this primarily to learn more about RMM's Cython API, which I haven't looked at much before. I have one suggestion and one question attached below.

python/rmm/_cuda/gpu.py Show resolved Hide resolved
python/rmm/_cuda/gpu.py Outdated Show resolved Hide resolved
@harrism
Copy link
Member

harrism commented Dec 6, 2021

This is so cool. RMM is getting leaner!

@harrism harrism added the improvement Improvement / enhancement to an existing function label Dec 6, 2021
@harrism
Copy link
Member

harrism commented Dec 6, 2021

Is this breaking or non-breaking?

@shwina
Copy link
Contributor Author

shwina commented Dec 6, 2021

I'm marking it non-breaking, as the only changes are to non-public APIs. However, this will break cudf as it consumes non-public RMM APIs.

As part of this PR, I'm going to remove those APIs entirely from RMM (rmm._cuda.gpu), and instead have cudf also use CUDA Python directly.

@jakirkham
Copy link
Member

jakirkham commented Dec 6, 2021

Also should we be adding cuda-python to install_requires in setup.py as well as in the Conda recipe and environments?

@shwina
Copy link
Contributor Author

shwina commented Dec 6, 2021

Thanks @jakirkham - I just did that. Could you please double check that the constraints are appropriate?

@shwina shwina added the non-breaking Non-breaking change label Dec 6, 2021
@bdice
Copy link
Contributor

bdice commented Dec 6, 2021

Also should we be adding coda-python to install_requires in setup.py as well as in the Conda recipe and environments?

Also python/dev_requirements.txt. Also, the install_requires are currently unpinned but the setup_requires is pinned, which strikes me as unexpected.

Is it intentional that we have pinnings in pyproject.toml and setup.py's setup_requires for Cython? I think we only need the pyproject.toml build system but am not sure.

@jakirkham
Copy link
Member

Thanks @jakirkham - I just did that. Could you please double check that the constraints are appropriate?

Hmm...I'm not seeing them. Did they get pushed?

@github-actions github-actions bot added the conda label Dec 7, 2021
conda/environments/rmm_dev_cuda10.1.yml Outdated Show resolved Hide resolved
conda/recipes/rmm/meta.yaml Outdated Show resolved Hide resolved
python/setup.py Show resolved Hide resolved
python/rmm/_cuda/gpu.py Outdated Show resolved Hide resolved
@shwina shwina removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jan 14, 2022
Comment on lines 10 to 12
err, name = cudart.cudaGetErrorName(status)
if err != cudart.cudaError_t.cudaSuccess:
raise CUDARuntimeError(err.value)
Copy link
Contributor

@bdice bdice Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's awkward that we might raise a CUDARuntimeError from within the constructor of CUDARuntimeError... which got me to looking a little closer.

I saw the suggestion from @jakirkham in #930 (comment) / 8be6c0b but I'm pretty sure this is a spurious error check that illuminates some slightly awkward design in cuda-python (a meaningless err value).

For cudaGetErrorName, the err value is hardcoded as cudaError_t.cudaSuccess. It's not likely that this will ever change, because the corresponding runtime API does not generate an CUDA error: cudaGetErrorName returns a const char* with a special value of "unrecognized error code" rather than setting an error code if the parameters are invalid.
https://github.com/NVIDIA/cuda-python/blob/746b773c91e1ede708fe9a584b8cdb1c0f32b51d/cuda/cudart.pyx#L8084

Similarly for cudaGetErrorString... sort of:
https://github.com/NVIDIA/cuda-python/blob/746b773c91e1ede708fe9a584b8cdb1c0f32b51d/cuda/cudart.pyx#L8115
This function actually can set an error code via its call to the driver API cuGetErrorString, but that error is not returned -- the cudaGetErrorString in cuda-python also has err hard-coded as a success and the value "unrecognized error code" is returned. https://github.com/NVIDIA/cuda-python/blob/746b773c91e1ede708fe9a584b8cdb1c0f32b51d/cuda/_lib/ccudart/ccudart.pyx#L522-L531

In summary, I think it would be safe to revert 8be6c0b and use the previous snippet with a throwaway variable _, name = ... for both cudaGetErrorName and cudaGetErrorString because the err is always a success. As a consequence, we don't need to handle this awkward possibility of raising an error class in its own constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is awkward. Would hope this error can't come up, but it might come if the error code is new. IOW only shows up in later CUDA versions, but not earlier ones. Or it could show up if the error code is malformed.

Maybe a middle ground would be to raise a RuntimeError so we don't have a recursive CUDARuntimeError definition. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a new or malformed error code is provided, the returned string from cudaGetErrorName or cudaGetErrorString will say "unrecognized error code" but no CUDA error will be set. Thus we do not need to check for a CUDA error from those functions. The cuda-python decision to hard-code the err value as "success" strikes me as odd because no err should be returned at all.

If we were to need to raise an error, the recursive use of CUDARuntimeError is not problematic/incorrect -- just awkward.

However, I don't have strong enough feelings on this to hold back on merging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should check for that build string and then error? Having suppressed errors doesn't sound good.

Agree with your assessment CUDA-Python should be handling this better, but maybe we can have that discussion in a new CUDA-Python issue?

Copy link
Contributor

@bdice bdice Jan 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea! Raise a RuntimeError if we get the unrecognized error code string back, but don’t check the value of err:

Suggested change
err, name = cudart.cudaGetErrorName(status)
if err != cudart.cudaError_t.cudaSuccess:
raise CUDARuntimeError(err.value)
_, name = cudart.cudaGetErrorName(status)
if name == "unrecognized error code":
raise RuntimeError(name)

(and similarly for cudaGetErrorString)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix'd!

Copy link
Contributor Author

@shwina shwina Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a brief sync offline with Bradley, we decided to just not worry about handling an unrecognized error code, because we couldn't figure out how such an error code could be constructed to begin with. cudaError_t is an enum that cannot be constructed from arbitrary integers.

python/rmm/_cuda/gpu.py Outdated Show resolved Hide resolved
python/rmm/_cuda/gpu.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @shwina!

@jakirkham
Copy link
Member

rerun tests

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to simplify. I will leave the Python review up to Python experts -- what would you like me to review?

python/rmm/_cuda/gpu.py Show resolved Hide resolved
@jakirkham
Copy link
Member

I think you had some review comments above. So was curious if those were addressed from your perspective

If anything else pops out, please let us know :)

@harrism
Copy link
Member

harrism commented Jan 19, 2022

I think you had some review comments above. So was curious if those were addressed from your perspective

If anything else pops out, please let us know :)

Just the one comment about why we are keeping some cudart functions in RMM.

@shwina
Copy link
Contributor Author

shwina commented Jan 19, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit d94bdfd into rapidsai:branch-22.02 Jan 19, 2022
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Jan 20, 2022
… CUDA Python bindings (#10008)

This PR replaces custom CUDA bindings that are provided by RMM, with official CUDA Python bindings. This PR should be merged after the RMM PR  rapidsai/rmm#930

Authors:
  - Ashwin Srinath (https://github.com/shwina)

Approvers:
  - Jordan Jacobelli (https://github.com/Ethyling)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)

URL: #10008
rapids-bot bot pushed a commit to rapidsai/raft that referenced this pull request Jan 20, 2022
…451)

As a follow up to rapidsai/rmm#930, fix RAFT to rely on CUDA Python directly rather than custom  CUDA bindings that RMM provided.

Authors:
  - Ashwin Srinath (https://github.com/shwina)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Jordan Jacobelli (https://github.com/Ethyling)

URL: #451
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this pull request Jan 20, 2022
Please do not merge until rapidsai/rmm#930 is merged.

For the reasons described in that PR, this API has changed to accept a `cuda.cudart.cudaDeviceAttr` object, `cuda` being the official CUDA Python bindings package.

Authors:
  - Ashwin Srinath (https://github.com/shwina)

Approvers:
  - Brad Rees (https://github.com/BradReesWork)
  - Rick Ratzel (https://github.com/rlratzel)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #2008
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conda improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Related to RMM Python API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants