[Feature Request] Add score_tree_interval during early stopping #5090

kylejn27 · 2019-12-04T15:37:24Z

Request:

Add a score_tree_interval option so that when you're building with really large data, the model doesn't eval on each tree.

Similar to this:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/score_tree_interval.html

Purpose:

With large data, scoring every iteration on the validation set is extremely costly. Currently with early_stopping_rounds the behavior is to score on every early_stopping_round round.
https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor.fit.

It would be nice to be able to space out the width of the early stopping rounds, as H2O does, see H2O's stopping_rounds parameter for example. You can tell H2O to score every 20 trees and if the model hasn't improved in 5 scoring iterations (e.g., 100 trees), then XGBoost would stop training.

It would also be nice if you could also enable different early stopping at different points in the tree. For example, suppose you wanted to not score on the eval_set until the 1000th tree, and then score on every tree. This would help make the training time more efficient if you knew before hand (say, from prior modeling runs) how long the tree took before converging.

I'm primarily concerned with the python implementation of this library, I don't think this has been implemented elsewhere already

The text was updated successfully, but these errors were encountered:

trivialfis · 2019-12-04T18:19:37Z

Em. I wanted to implement stage_predict for XGBoost.

kylejn27 · 2019-12-10T14:49:52Z

happy to take a stab at it, though due to certain restrictions I won't be able to PR into this repository until I get approval but I can collaborate here in the meantime

trivialfis · 2019-12-11T02:36:51Z

Sorry for the late reply. I will take a look into this option today.

kylejn27 · 2019-12-11T15:06:54Z

no problem, not super high priority for me. Mostly interested in getting familiar with the internals of this library and I think attempting to implement this kind of thing would teach me a lot

trivialfis · 2019-12-11T17:36:53Z

Give it a go, and feel free to reach me if you need any help. One personal advice, be careful of the prediction cache, it bites.

kylejn27 · 2019-12-11T17:47:39Z

great, thanks!

kylejn27 · 2020-01-02T20:13:59Z

@trivialfis Just getting some time to start looking at this today. I think there was a misunderstanding on what I'd be implementing. I was intending on only scoring certain iterations of the model during model fit to improve model fit performance (speed not accuracy). staged_predict is on the prediction side.

I don't know of any parameter following the scikit learn implementation that enables this kind of thing so it would have to be an xgboost unique parameter

The logic for this kind of thing should be fairly simple. At a quick glance all that would need to be done is to skip the eval_set method in this loop if the iteration is not of the iterations to score.

just off the top of my head it would probably look something like this

# start scoring on the 100th tree, then score only every 5th tree after that 
# these would be variables given as input by the user
start_eval = 100
eval_interval = 5

...
...

# i is the current iteration of the bst.update(...) loop
if evals and i in range(start_eval, num_boost_round, eval_interval):
    bst_eval_set = bst.eval_set(evals, i, feval)
    if isinstance(bst_eval_set, STRING_TYPES):
        msg = bst_eval_set
    else:
        msg = bst_eval_set.decode()
    res = [x.split(':') for x in msg.split()]
    evaluation_result_list = [(k, float(v)) for k, v in res[1:]]

...
...

EDIT:

looks like my initial idea was a bit naive, there would be modifications ot the early stopping callback as well

trivialfis · 2020-01-27T21:22:23Z

Oops, sorry for missing the ping. Will look into it. Might be slow in response this week.

kylejn27 · 2020-01-28T15:32:20Z

No problem at all! Take your time

trivialfis added the feature-request label Dec 4, 2019

kylejn27 mentioned this issue Jan 27, 2020

Add score tree interval #5238

Open

trivialfis self-assigned this Dec 16, 2020

trivialfis mentioned this issue Feb 8, 2021

[Roadmap] 1.4.0 Roadmap #6500

Closed

23 tasks

trivialfis mentioned this issue Apr 12, 2021

[Roadmap] 1.5.0 Roadmap #6846

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Add score_tree_interval during early stopping #5090

[Feature Request] Add score_tree_interval during early stopping #5090

kylejn27 commented Dec 4, 2019

trivialfis commented Dec 4, 2019

kylejn27 commented Dec 10, 2019 •

edited

Loading

trivialfis commented Dec 11, 2019

kylejn27 commented Dec 11, 2019

trivialfis commented Dec 11, 2019

kylejn27 commented Dec 11, 2019

kylejn27 commented Jan 2, 2020 •

edited

Loading

trivialfis commented Jan 27, 2020

kylejn27 commented Jan 28, 2020

[Feature Request] Add score_tree_interval during early stopping #5090

[Feature Request] Add score_tree_interval during early stopping #5090

Comments

kylejn27 commented Dec 4, 2019

Request:

Purpose:

trivialfis commented Dec 4, 2019

kylejn27 commented Dec 10, 2019 • edited Loading

trivialfis commented Dec 11, 2019

kylejn27 commented Dec 11, 2019

trivialfis commented Dec 11, 2019

kylejn27 commented Dec 11, 2019

kylejn27 commented Jan 2, 2020 • edited Loading

trivialfis commented Jan 27, 2020

kylejn27 commented Jan 28, 2020

kylejn27 commented Dec 10, 2019 •

edited

Loading

kylejn27 commented Jan 2, 2020 •

edited

Loading