Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add score_tree_interval during early stopping #5090

Open
kylejn27 opened this issue Dec 4, 2019 · 9 comments
Open

[Feature Request] Add score_tree_interval during early stopping #5090

kylejn27 opened this issue Dec 4, 2019 · 9 comments
Assignees

Comments

@kylejn27
Copy link
Contributor

kylejn27 commented Dec 4, 2019

Request:

Add a score_tree_interval option so that when you're building with really large data, the model doesn't eval on each tree.

Similar to this:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/score_tree_interval.html

Purpose:

With large data, scoring every iteration on the validation set is extremely costly. Currently with early_stopping_rounds the behavior is to score on every early_stopping_round round.
https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor.fit.

It would be nice to be able to space out the width of the early stopping rounds, as H2O does, see H2O's stopping_rounds parameter for example. You can tell H2O to score every 20 trees and if the model hasn't improved in 5 scoring iterations (e.g., 100 trees), then XGBoost would stop training.

It would also be nice if you could also enable different early stopping at different points in the tree. For example, suppose you wanted to not score on the eval_set until the 1000th tree, and then score on every tree. This would help make the training time more efficient if you knew before hand (say, from prior modeling runs) how long the tree took before converging.

I'm primarily concerned with the python implementation of this library, I don't think this has been implemented elsewhere already

@trivialfis
Copy link
Member

Em. I wanted to implement stage_predict for XGBoost.

@kylejn27
Copy link
Contributor Author

kylejn27 commented Dec 10, 2019

happy to take a stab at it, though due to certain restrictions I won't be able to PR into this repository until I get approval but I can collaborate here in the meantime

@trivialfis
Copy link
Member

Sorry for the late reply. I will take a look into this option today.

@kylejn27
Copy link
Contributor Author

no problem, not super high priority for me. Mostly interested in getting familiar with the internals of this library and I think attempting to implement this kind of thing would teach me a lot

@trivialfis
Copy link
Member

Give it a go, and feel free to reach me if you need any help. One personal advice, be careful of the prediction cache, it bites.

@kylejn27
Copy link
Contributor Author

great, thanks!

@kylejn27
Copy link
Contributor Author

kylejn27 commented Jan 2, 2020

@trivialfis Just getting some time to start looking at this today. I think there was a misunderstanding on what I'd be implementing. I was intending on only scoring certain iterations of the model during model fit to improve model fit performance (speed not accuracy). staged_predict is on the prediction side.

I don't know of any parameter following the scikit learn implementation that enables this kind of thing so it would have to be an xgboost unique parameter

The logic for this kind of thing should be fairly simple. At a quick glance all that would need to be done is to skip the eval_set method in this loop if the iteration is not of the iterations to score.

just off the top of my head it would probably look something like this

# start scoring on the 100th tree, then score only every 5th tree after that 
# these would be variables given as input by the user
start_eval = 100
eval_interval = 5

...
...

# i is the current iteration of the bst.update(...) loop
if evals and i in range(start_eval, num_boost_round, eval_interval):
    bst_eval_set = bst.eval_set(evals, i, feval)
    if isinstance(bst_eval_set, STRING_TYPES):
        msg = bst_eval_set
    else:
        msg = bst_eval_set.decode()
    res = [x.split(':') for x in msg.split()]
    evaluation_result_list = [(k, float(v)) for k, v in res[1:]]

...
...

EDIT:

looks like my initial idea was a bit naive, there would be modifications ot the early stopping callback as well

@trivialfis
Copy link
Member

Oops, sorry for missing the ping. Will look into it. Might be slow in response this week.

@kylejn27
Copy link
Contributor Author

No problem at all! Take your time

@trivialfis trivialfis self-assigned this Dec 16, 2020
@trivialfis trivialfis mentioned this issue Feb 8, 2021
23 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants