-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] cuML's estimator Base class for preprocessing models #3270
[REVIEW] cuML's estimator Base class for preprocessing models #3270
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had one question about the relationship with PR #3257
return X | ||
|
||
def _more_tags(self): | ||
return {'allow_nan': True} | ||
return {'X_types_gpu': ['2darray', 'sparse'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viclafargue we're in the process of making the tags system static in #3257, so depending on timing that PR will affect this one or the other way around. Do you foresee many issues arising from that change for these classes in _thirdparty/sklearn
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, I'll wait for your PR. Should be fairly simple, the models inherit from the Base
class. I'll just have to make _more_tags
methods static everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viclafargue I think were interested in knowing if you have any tags that are dynamic and will change from one instance to the other depending on the properties of the class. Or can all of the tags be determined in a static method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't knew these tags could be instance specific. From what I could see, all of them seems to be class specific (static) for preprocessing. After closer look, there seems to be at least one occurrence of instance-specific tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which tag would be instance specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see it, let me think on on it for a second, have a couple of ideas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is using the AllowNaNTagMixin
already, instead of defining _more_tags, you can just add
cuml/python/cuml/common/mixins.py
Line 306 in c4c4068
class SparseInputTagMixin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this PR I think the use of CumlArrayDescriptor looks pretty good but I have some concerns about the use of decorators and the class inheritance. In order to work seamlessly with the descriptors/decorators added in 0.17, this will need some significant changes to the architecture (the ESTIMATOR_GUIDE.md might be helpful).
Before approving this PR or make any suggestions, I would prefer to discuss the design decisions with Victor to understand the motivation first and then do another review.
return X | ||
|
||
def _more_tags(self): | ||
return {'allow_nan': True} | ||
return {'X_types_gpu': ['2darray', 'sparse'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7023c85
to
0e467f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on my review @viclafargue
return X | ||
|
||
def _more_tags(self): | ||
return {'allow_nan': True} | ||
return {'X_types_gpu': ['2darray', 'sparse'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is using the AllowNaNTagMixin
already, instead of defining _more_tags, you can just add
cuml/python/cuml/common/mixins.py
Line 306 in c4c4068
class SparseInputTagMixin: |
python/cuml/_thirdparty/sklearn/preprocessing/_discretization.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Will give feedback on the rest tomorrow - just need to think on it a bit but wanted to add this comment first)
I see the TODOs about preserving order. Is this something that should really be a generic feature of the base class output conversion? Maybe some other transform-type models need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for being so slow! I REALLY debated whether there was another approach that would reduce the delta in the _thirdparty_dependencies section, but in the end I believe you found the best solution so we should move forward with this PR.
My one caveat overall is that we should make sure we're testing the various different sparse matrix formats that are supported... in a couple of cases, I think we may not be testing them all (noted in comments)
870a942
to
fd5e73c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good from my review pov
Looks great! |
Outdated review from several months ago.
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #3270 +/- ##
===============================================
+ Coverage 80.87% 82.23% +1.36%
===============================================
Files 228 226 -2
Lines 17630 17480 -150
===============================================
+ Hits 14258 14375 +117
+ Misses 3372 3105 -267
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@gpucibot merge |
Answers #3201 .
This PR makes preprocessing models fully compliant with cuML's estimator Base class and the tagging system.
Preprocessing models were decorated with
cuml_estimator
and preprocessing functions withcuml_function
to make use of features offered by cuML's estimator Base class. The return type offit
andtransform
method were specified.CumlArrayDescriptor
attributes were created and theget_param_names
and_more_tags
methods were added when necessary.As the
SparseCumlArray
class can only handle CSR matrices for now, preprocessing models will only return this type as sparse outputs.