Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Allow saving Dask RandomForest models immediately after training (fixes #3331) #3388

Merged
merged 5 commits into from
Feb 1, 2021

Conversation

jameslamb
Copy link
Member

This attempts to fix #3331. See that issue for a lot more details.

Today, .get_combined_model() for the Dask RandomForest model objects returns None if it's called immediately after training. That pattern is recommended in "Distributed Model Pickling". Without this support, there is not a way to save a Dask RandomForest model using only public methods / attributes on those classes.

Per #3331 (comment), this PR proposes populating the internal model object whenever get_combined_model() is called.

Notes for Reviewers

  • I have not tested this locally. I spent about 3 hours trying to build cuml from source following https://github.com/rapidsai/cuml/blob/main/BUILD.md, and was not successful. If there is a containerized setup for developing cuml, I'd greatly appreciate it and would be happy to try it out. I've added a unit test for this change, so I hope that will be enough to confirm that this works and that CI will catch any mistakes I've made.

Thanks for your time and consideration.

@jameslamb jameslamb requested a review from a team as a code owner January 20, 2021 06:14
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Jan 20, 2021
@hcho3
Copy link
Contributor

hcho3 commented Jan 20, 2021

@jameslamb I'm sorry to hear that you had trouble building cuML from the source. Here is a page for a Dockerized setup. Also, feel free to ping me if you'd like more help with building cuML (I do it at least several times a week).

@hcho3
Copy link
Contributor

hcho3 commented Jan 21, 2021

rerun tests

@hcho3 hcho3 added Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. feature request New feature or request non-breaking Non-breaking change 3 - Ready for Review Ready for review by team labels Jan 21, 2021
@codecov-io
Copy link

Codecov Report

Merging #3388 (eb67bd1) into branch-0.18 (550121b) will increase coverage by 0.12%.
The diff coverage is 84.78%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #3388      +/-   ##
===============================================
+ Coverage        71.48%   71.60%   +0.12%     
===============================================
  Files              207      210       +3     
  Lines            16748    16932     +184     
===============================================
+ Hits             11973    12125     +152     
- Misses            4775     4807      +32     
Impacted Files Coverage Δ
python/cuml/decomposition/incremental_pca.py 94.70% <ø> (ø)
python/cuml/dask/ensemble/base.py 19.69% <30.43%> (+0.36%) ⬆️
...ython/cuml/dask/ensemble/randomforestclassifier.py 29.76% <40.00%> (+0.27%) ⬆️
python/cuml/dask/ensemble/randomforestregressor.py 34.42% <40.00%> (-0.12%) ⬇️
python/cuml/ensemble/randomforestregressor.pyx 70.83% <44.44%> (ø)
python/cuml/fil/fil.pyx 91.87% <60.00%> (-1.88%) ⬇️
python/cuml/ensemble/randomforestclassifier.pyx 73.72% <66.66%> (ø)
python/cuml/multiclass/multiclass.py 84.21% <84.21%> (ø)
python/cuml/model_selection/_split.py 90.35% <90.35%> (ø)
python/cuml/svm/svm_base.pyx 94.27% <91.30%> (-0.63%) ⬇️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d72c54a...eb67bd1. Read the comment docs.

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and thank you for the contribution! I have two small requests.
Also, one test is failing but that is for an unrelated issue currently being fixed in #3391

yet been trained.
"""

# set internal model if it hasn't been accessed before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these to dask.ensemble.base instead of the two versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sure, I can try that. I didn't think that would work because cuml.dask.ensemble.base.BaseRandomForestModel inherits from object and doesn't have self._get_internal_model() defined.

But I see now that it references that method (

if self._get_internal_model() is None:
), so I guess that class isn't intended to be used by itself and assumes it's only used as a mixin together with cuml.dask.common.base.BaseEstimator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved in 83f10b7

)
X = X.astype(np.float32)
if estimator_type == 'classification':
cu_rf_mg = cuRFC_mg(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the params here are either defaults or close to defaults. You could omit them to shrink the test and make it easier to maintain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, no problem. I copied this exactly from the existing dask random forest tests:

cu_rf_mg = cuRFC_mg(max_features=1.0, max_samples=1.0,
n_bins=16, split_algo=0, split_criterion=0,
min_samples_leaf=2, seed=23707, n_streams=1,
n_estimators=n_estimators, max_leaves=-1,
max_depth=max_depth)
y = y.astype(np.int32)
elif estimator_type == 'regression':
cu_rf_mg = cuRFR_mg(max_features=1.0, max_samples=1.0,
n_bins=16, split_algo=0,
min_samples_leaf=2, seed=23707, n_streams=1,
n_estimators=n_estimators, max_leaves=-1,
max_depth=max_depth)
y = y.astype(np.float32)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 83f10b7. I tried to keep a few that would have the most impact on the runtime of training, to keep the tests quick

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a fix in 6a1f2d2, sorry. I got confused by the different levels of inheritance

@JohnZed
Copy link
Contributor

JohnZed commented Jan 28, 2021

rerun tests

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, @jameslamb ! Sorry about the delay, I wanted to also try this out locally. Pickles smoothly now.

@JohnZed
Copy link
Contributor

JohnZed commented Feb 1, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit df67553 into rapidsai:branch-0.18 Feb 1, 2021
@jameslamb
Copy link
Member Author

No problem! Thanks for the reviews, and to @hcho3 for the pointer to a dockerized setup for building cuml from source. I'll definitely make use of that in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team Cython / Python Cython or Python issue Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Dask RandomForestClassifier get_combined_model() and .internal_model return None
4 participants