Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] PBT exploits trials that have no checkpoint, yet #27170

Closed
Tracked by #27165
krfricke opened this issue Jul 28, 2022 · 6 comments
Closed
Tracked by #27165

[tune] PBT exploits trials that have no checkpoint, yet #27170

krfricke opened this issue Jul 28, 2022 · 6 comments
Assignees
Labels
P1 Issue that should be fixed within a few weeks ray-team-created Ray Team created tune Tune-related issues

Comments

@krfricke
Copy link
Contributor

krfricke commented Jul 28, 2022

It seems that PBT sometimes tries to exploit trials that are currently being restarted or have just been restarted with old statistics.

PBT tries to exploit trials that never checkpointed. While this could be acceptable for hyperparameter exploitation, it will not progress training.

We should investigate this to see if this is actually the case.

On a different note, PBT pauses function trainable trials that never checkpointed and tries to resume them later. This doesn't work as there is no checkpoint to resume from.

@krfricke krfricke added P1 Issue that should be fixed within a few weeks tune Tune-related issues labels Jul 28, 2022
@xwjiang2010
Copy link
Contributor

I don't quite get this one, can you elaborate?

@krfricke krfricke changed the title [tune] PBT potentially exploits trials that are currently being restarted [tune] PBT exploits trials that have no checkpoint, yet Jul 28, 2022
@krfricke
Copy link
Contributor Author

I've looked into this and updated the issue description and title.

I think one of the main problems is that pbt both exploits and tries to pause/resume trials for which it has not received a checkpoint. These trials will always run from scratch.

The solution for the horovod test is to either checkpoint every step or only report once at the end of the iteration - I'll open a PR for that

@krfricke
Copy link
Contributor Author

@justinvyu has this been resolved by #28509?

@justinvyu
Copy link
Contributor

I believe that PR doesn't fix this issue. My understanding of this issue is that checkpoint_interval > perturbation_interval, which is part of the user-defined configuration. I don't have a good way of fixing the case where the user doesn't checkpoint in time for the first perturbation. I have added suggestions in updated examples/guides that recommends checkpointing on every perturbation, or more often than perturbations (ex: this guide). Perhaps we should log a warning and recommend a different checkpointing interval. However, one problem with function checkpointing is that the interval is not controlled by Tune.

I do have a PR that I'll post today that fixes a related issue, where PBT only pulls the checkpoint of trials in the upper quantile, but it's possible for a lower-quantile trial to get upgraded. In this case, my fix was to always save the checkpoint of each trial, since there's no guarantees of which trial will need to be exploited in the next perturbation round.

@richardliaw richardliaw added the ray-team-created Ray Team created label Dec 21, 2022
@krfricke
Copy link
Contributor Author

@justinvyu with your improvements to PBT, this should be resolved, right?

@krfricke
Copy link
Contributor Author

Closing this for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Issue that should be fixed within a few weeks ray-team-created Ray Team created tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

4 participants