Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Exposing model_selection in a similar way to scikit-learn #3329

Merged
merged 12 commits into from
Jan 13, 2021

Conversation

ptartan21
Copy link
Contributor

Resolving #3267. It seems that model_selection is already properly exposed through cuml.preprocessing.model_selection A small test suite for train_test_split is included in this PR to demonstrate that it works as desired.

@ptartan21 ptartan21 requested a review from a team as a code owner December 28, 2020 23:33
@GPUtester
Copy link
Contributor

Can one of the admins verify this patch?

1 similar comment
@GPUtester
Copy link
Contributor

Can one of the admins verify this patch?

@wphicks
Copy link
Contributor

wphicks commented Dec 29, 2020

I believe the issue raised in #3267 is that this feature is exposed in a different place from sklearn. A proper test for resolution of this issue should include the line from cuml.model_selection import train_test_split.

@ptartan21
Copy link
Contributor Author

I believe the issue raised in #3267 is that this feature is exposed in a different place from sklearn. A proper test for resolution of this issue should include the line from cuml.model_selection import train_test_split.

I misunderstood the issue - it should be good now. Thanks!

@ptartan21 ptartan21 changed the title [REVIEW] Test model_selection exposure [REVIEW] Exposing model_selection in a similar way to scikit-learn Dec 29, 2020
Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some inline comments. Also, was there a particular reason model_selection.py was renamed and moved into a model_selection directory as part of this PR? I'm not constitutionally opposed to it, but it seems like an unrelated change that might better be performed as part of a PR adding other model selection features.

@@ -0,0 +1,4 @@
from ._split import train_test_split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's switch this to an absolute import as recommended by PEP8.

@@ -13,7 +13,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
from cuml.preprocessing.model_selection import train_test_split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to remove this exposure, we need to go through a deprecation process. This will break at least one of our demos, so I recommend splitting it off into a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be appropriate to replace this with from cuml.model_selection import train_test_split for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general it's better to have __init__ files import from the base source file rather than going through an additional layer of indirection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense, but is there a way to do this without duplicating code? If we do not move model_selection.py out from preprocessing, train_test_split could be imported from both cuml.preprocessing.model_selection and cuml.model_selection which may not be desirable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want it to be importable from both locations until we've gone through a deprecation process. We might start the deprecation process with this PR, though. We'll want a warning if users try to import from the old location, and then after a release cycle, we can eliminate it entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also get our examples updated accordingly.

@@ -24,13 +24,15 @@
PolynomialFeatures as cuPolynomialFeatures, \
SimpleImputer as cuSimpleImputer, \
RobustScaler as cuRobustScaler, \
KBinsDiscretizer as cuKBinsDiscretizer
KBinsDiscretizer as cuKBinsDiscretizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space at the end; our linters should pick it up, but mentioning it since I noticed it.

@ptartan21
Copy link
Contributor Author

Left some inline comments. Also, was there a particular reason model_selection.py was renamed and moved into a model_selection directory as part of this PR? I'm not constitutionally opposed to it, but it seems like an unrelated change that might better be performed as part of a PR adding other model selection features.

model_selection.py was moved to model_selection and renamed in order to mimic scikit-learn's API/module structure (https://github.com/scikit-learn/scikit-learn/tree/0.24.X/sklearn/model_selection) for train_test_split and other data splitting functions that could be added in the future.

@wphicks
Copy link
Contributor

wphicks commented Dec 29, 2020

All right! I think that's okay then.

Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you might have accidentally resurrected the old model_selection.py. Let's eliminate the duplicate code and make sure the warnings end up in the proper location.

@ptartan21
Copy link
Contributor Author

Looks like you might have accidentally resurrected the old model_selection.py. Let's eliminate the duplicate code and make sure the warnings end up in the proper location.

At first I wasn't sure how to deprecate without duplicating the code which was why I reincluded the old model_selection.py. The solution that I have now is to just raise a warning and call from cuml.model_selection

@wphicks
Copy link
Contributor

wphicks commented Jan 1, 2021

At first I wasn't sure how to deprecate without duplicating the code which was why I reincluded the old model_selection.py. The solution that I have now is to just raise a warning and call from cuml.model_selection

Ah, I see what you were getting at, but that shouldn't be necessary. model_selection.py can now just be:

from cuml.model_selection._split import train_test_split
# INSERT DEPRECATION WARNING HERE

No need to introduce the _new_* functions, and having them there will make the deprecation process more annoying than simply removing model_selection.py in a later commit. I guess technically to be absolutely certain we don't break anything, we should import all symbols formerly exposed in model_selection.py, but train_test_split is the most important one.

@ptartan21
Copy link
Contributor Author

At first I wasn't sure how to deprecate without duplicating the code which was why I reincluded the old model_selection.py. The solution that I have now is to just raise a warning and call from cuml.model_selection

Ah, I see what you were getting at, but that shouldn't be necessary. model_selection.py can now just be:

from cuml.model_selection._split import train_test_split
# INSERT DEPRECATION WARNING HERE

No need to introduce the _new_* functions, and having them there will make the deprecation process more annoying than simply removing model_selection.py in a later commit. I guess technically to be absolutely certain we don't break anything, we should import all symbols formerly exposed in model_selection.py, but train_test_split is the most important one.

This makes a lot of sense. Thank you!

Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! I think we're almost there. One question and one tweak, and then I think we can move this thing forward. Thanks for sticking with it!

python/cuml/model_selection/__init__.py Outdated Show resolved Hide resolved
python/cuml/preprocessing/model_selection.py Outdated Show resolved Hide resolved
Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks again for the work on this.

@wphicks wphicks linked an issue Jan 4, 2021 that may be closed by this pull request
@wphicks wphicks added feature request New feature or request non-breaking Non-breaking change labels Jan 4, 2021
@wphicks wphicks added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jan 4, 2021
@JohnZed
Copy link
Contributor

JohnZed commented Jan 5, 2021

Ok to test

@wphicks
Copy link
Contributor

wphicks commented Jan 6, 2021

Build errors are unrelated to this PR. Once #3316 is merged, we should merge mainline into this branch and test from there.

@codecov-io
Copy link

Codecov Report

Merging #3329 (9167de9) into branch-0.18 (550121b) will increase coverage by 0.04%.
The diff coverage is 84.34%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #3329      +/-   ##
===============================================
+ Coverage        71.48%   71.53%   +0.04%     
===============================================
  Files              207      208       +1     
  Lines            16748    16816      +68     
===============================================
+ Hits             11973    12029      +56     
- Misses            4775     4787      +12     
Impacted Files Coverage Δ
python/cuml/decomposition/incremental_pca.py 94.70% <ø> (ø)
python/cuml/dask/ensemble/base.py 19.69% <30.43%> (+0.36%) ⬆️
python/cuml/ensemble/randomforestregressor.pyx 70.83% <44.44%> (ø)
...ython/cuml/dask/ensemble/randomforestclassifier.py 30.00% <50.00%> (+0.51%) ⬆️
python/cuml/dask/ensemble/randomforestregressor.py 35.08% <50.00%> (+0.54%) ⬆️
python/cuml/fil/fil.pyx 91.87% <60.00%> (-1.88%) ⬇️
python/cuml/ensemble/randomforestclassifier.pyx 73.72% <66.66%> (ø)
python/cuml/model_selection/_split.py 90.35% <90.35%> (ø)
python/cuml/manifold/t_sne.pyx 79.42% <98.30%> (+3.34%) ⬆️
python/cuml/__init__.py 100.00% <100.00%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ad71f0...9167de9. Read the comment docs.

@rapids-bot rapids-bot bot merged commit ecd508c into rapidsai:branch-0.18 Jan 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Expose model_selection in a similar way to Scikit-learn
5 participants