Support slicing tree model #6302

trivialfis · 2020-10-28T08:16:45Z

This PR is meant the end the confusion around best_ntree_limit and unify model slicing. We have multi-class and random forests, asking users to understand how to set ntree_limit is difficult and error prone.

Close #5531 Close #4052.

Implement the save_best option in early stopping.
Negative index is not supported.

trivialfis · 2020-10-28T08:28:59Z

Currently I treat out of bound error specially and raise IndexError but ValueError for other issues. Should I always raise IndexError instead? Feel free to provide suggestions . ;-)

* Implement `save_best`.

codecov-io · 2020-10-30T04:16:39Z

Codecov Report

Merging #6302 into master will increase coverage by 0.56%.
The diff coverage is 96.36%.

@@            Coverage Diff             @@
##           master    #6302      +/-   ##
==========================================
+ Coverage   80.75%   81.32%   +0.56%     
==========================================
  Files          12       12              
  Lines        3372     3421      +49     
==========================================
+ Hits         2723     2782      +59     
+ Misses        649      639      -10

Impacted Files	Coverage Δ
python-package/xgboost/training.py	`96.49% <33.33%> (+0.03%)`	⬆️
python-package/xgboost/callback.py	`92.98% <100.00%> (+0.37%)`	⬆️
python-package/xgboost/core.py	`80.00% <100.00%> (+1.65%)`	⬆️
python-package/xgboost/tracker.py	`95.18% <0.00%> (+1.20%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 608bda7...60368c8. Read the comment docs.

trivialfis · 2020-10-30T09:48:06Z

Pasting offline conversion with @hcho3 here. The trees in xgboost can be considered as a 3 dim tensor. First dim is the number of boosting round, second is number of classes, and the last is size of forest. This PR supports only slicing the first dim (number of boosted rounds), it's possible to support slicing other dimensions. but to us that seems to be over engineering.

RAMitchell

This is a great new feature. Any plans to deprecate ntree_limit throughout the code base in favour of the new terminology?

trivialfis · 2020-11-03T00:11:43Z

There are other language bindings out there. I need to go over them to deprecate the parameter.

hcho3 · 2020-11-03T02:30:01Z

doc/python/model.rst

+    dtrain = xgb.DMatrix(data=X, label=y)
+    num_parallel_tree = 4
+    num_boost_round = 16
+    total_trees = num_parallel_tree * num_classes * num_boost_round


This variable is not used anywhere in the code snippet.

Suggested change

total_trees = num_parallel_tree * num_classes * num_boost_round

Converted into a comment.

hcho3 · 2020-11-03T02:41:24Z

include/xgboost/c_api.h

+ * \brief Slice a model according to layers.
+ *
+ * \param handle Booster to be sliced.
+ * \param begin_layer start of the slice
+ * \param end_layer   end of the slice
+ * \param step        step size of the slice
+ * \param out Sliced booster.


Suggested change

* \brief Slice a model according to layers.

*

* \param handle Booster to be sliced.

* \param begin_layer start of the slice

* \param end_layer end of the slice

* \param step step size of the slice

* \param out Sliced booster.

* \brief Slice a model using boosting index. The slice m:n indicates taking all trees

* that were fit during the boosting rounds m, (m+1), (m+2), ..., (n-1).

*

* \param handle Booster to be sliced.

* \param begin_layer start of the slice

* \param end_layer end of the slice; end_layer=0 is equivalent to

* end_layer=num_boost_round

* \param step step size of the slice

* \param out Sliced booster.

Added comments.

hcho3 · 2020-11-03T03:09:10Z

include/xgboost/gbm.h

+  /*!
+   * \brief Slice the model.
+   * \param layer_begin Begining of boosted tree layer used for prediction.
+   * \param layer_end   End of booster layer. 0 means do not limit trees.
+   * \param out         Output gradient booster
+   */


Suggested change

/*!

* \brief Slice the model.

* \param layer_begin Begining of boosted tree layer used for prediction.

* \param layer_end End of booster layer. 0 means do not limit trees.

* \param out Output gradient booster

*/

/*!

* \brief Slice a model using boosting index. The slice m:n indicates taking all trees

* that were fit during the boosting rounds m, (m+1), (m+2), ..., (n-1).

* \param layer_begin Begining of boosted tree layer used for prediction.

* \param layer_end End of booster layer. 0 means do not limit trees.

* \param out Output gradient booster

*/

Added comments.

hcho3 · 2020-11-03T03:18:04Z

tests/python/test_basic_models.py

+    def test_slice(self):
+        self.run_slice('gbtree')
+        self.run_slice('dart')


Can we use @pytest.mark.parameterize instead?

@pytest.mark.parameterize(booster, ['gbtree', 'dart']) def test_slice(self, booster): # Body of test

See examples at Parameterizing tests

It seems not being compatible with class method:

TypeError: test_slice() missing 1 required positional argument: 'booster'

Try @pytest.mark.parameterize('booster', ['gbtree', 'dart']) (note the quotes around booster

@trivialfis Also, TestModels should not be a subclass of unittest.TestCase. https://stackoverflow.com/a/35562401. Try making the class a subclass of object:

class TestModels(object):

Thanks for the suggestion, done.

hcho3 · 2020-11-03T03:22:57Z

src/gbm/gbtree.cc

+      layer_begin, layer_end, step, this->model_, tparam_, layer_trees,
+      [&](auto const &in_it, auto const &out_it) {
+        auto new_tree =
+            std::make_unique<RegTree>(*this->model_.trees.at(in_it));


Do we have assurance that the implicitly generated copy constructor RegTree(const RegTree&) behaves correctly?

Added tests with prediction.

hcho3 · 2020-11-03T07:23:26Z

The parametrization looks quite nice. The benefits are as follows:

If a parametrized test fails for only one of the multiple parameters, the test result will clearly indicate which parameter led to failure.
It allows for more compact code, as we no longer have to repeat the same code to test multiple scenarios.

I'd like to submit a follow-up PR to introduce more test parametrization where it's appropriate. For example, the following snippet can be made more compact using a test parametrization:

xgboost/tests/python/test_with_dask.py

Lines 431 to 444 in 29745c6

    
           def test_empty_dmatrix_hist(): 
        
               with LocalCluster(n_workers=kWorkers) as cluster: 
        
                   with Client(cluster) as client: 
        
                       parameters = {'tree_method': 'hist'} 
        
                       run_empty_dmatrix_reg(client, parameters) 
        
                       run_empty_dmatrix_cls(client, parameters) 
        
           def test_empty_dmatrix_approx(): 
        
               with LocalCluster(n_workers=kWorkers) as cluster: 
        
                   with Client(cluster) as client: 
        
                       parameters = {'tree_method': 'approx'} 
        
                       run_empty_dmatrix_reg(client, parameters) 
        
                       run_empty_dmatrix_cls(client, parameters)

trivialfis mentioned this pull request Oct 28, 2020

[Roadmap] 1.3.0 Roadmap #6031

Closed

14 tasks

hcho3 added the Blocking label Oct 30, 2020

trivialfis added 7 commits October 30, 2020 09:53

Allow slicing tree models.

182ad8f

* Implement `save_best`.

Add test for save model.

e173173

Consistent type.

8723f80

Checks.

a33657a

Generate out of bound error.

c9ddda0

Fix legacy callback.

40b5562

Fix CV.

5cb2d5e

trivialfis force-pushed the slice-model branch from 5fbb2f2 to 5cb2d5e Compare October 30, 2020 01:54

Add tests for gblinear.

60368c8

trivialfis requested review from RAMitchell and hcho3 October 30, 2020 09:49

trivialfis mentioned this pull request Oct 30, 2020

Early stopping for training continuation. #6322

Closed

RAMitchell approved these changes Nov 2, 2020

View reviewed changes

trivialfis and others added 2 commits November 2, 2020 19:36

Stronger tests.

ef48197

Fix formatting in model.rst

52948c6

hcho3 reviewed Nov 3, 2020

View reviewed changes

trivialfis added 7 commits November 2, 2020 22:44

Parametrize

7e30509

Doc and comment.

a087793

Add prediction test.

4dc4dd3

Revert parametrize.

ae9ee40

Use margin.

e83b6b4

Merge remote-tracking branch 'fis/slice-model' into slice-model

0a5ebcb

Use pytest raises and parametrize.

accc0f1

hcho3 approved these changes Nov 3, 2020

View reviewed changes

hcho3 merged commit 2cc9662 into dmlc:master Nov 3, 2020

trivialfis deleted the slice-model branch November 3, 2020 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support slicing tree model #6302

Support slicing tree model #6302

trivialfis commented Oct 28, 2020

trivialfis commented Oct 28, 2020 •

edited

Loading

codecov-io commented Oct 30, 2020 •

edited

Loading

trivialfis commented Oct 30, 2020

RAMitchell left a comment

trivialfis commented Nov 3, 2020

hcho3 Nov 3, 2020

trivialfis Nov 3, 2020

hcho3 Nov 3, 2020

trivialfis Nov 3, 2020

hcho3 Nov 3, 2020

trivialfis Nov 3, 2020

hcho3 Nov 3, 2020 •

edited

Loading

trivialfis Nov 3, 2020

hcho3 Nov 3, 2020 •

edited

Loading

hcho3 Nov 3, 2020

trivialfis Nov 3, 2020

hcho3 Nov 3, 2020

trivialfis Nov 3, 2020

hcho3 commented Nov 3, 2020 •

edited

Loading

Support slicing tree model #6302

Support slicing tree model #6302

Conversation

trivialfis commented Oct 28, 2020

trivialfis commented Oct 28, 2020 • edited Loading

codecov-io commented Oct 30, 2020 • edited Loading

Codecov Report

trivialfis commented Oct 30, 2020

RAMitchell left a comment

Choose a reason for hiding this comment

trivialfis commented Nov 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hcho3 Nov 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hcho3 Nov 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hcho3 commented Nov 3, 2020 • edited Loading

trivialfis commented Oct 28, 2020 •

edited

Loading

codecov-io commented Oct 30, 2020 •

edited

Loading

hcho3 Nov 3, 2020 •

edited

Loading

hcho3 Nov 3, 2020 •

edited

Loading

hcho3 commented Nov 3, 2020 •

edited

Loading