[tune] PBT exploits trials that have no checkpoint, yet #27170

krfricke · 2022-07-28T13:28:50Z

~~It seems that PBT sometimes tries to exploit trials that are currently being restarted or have just been restarted with old statistics.~~

PBT tries to exploit trials that never checkpointed. While this could be acceptable for hyperparameter exploitation, it will not progress training.

We should investigate this to see if this is actually the case.

On a different note, PBT pauses function trainable trials that never checkpointed and tries to resume them later. This doesn't work as there is no checkpoint to resume from.

xwjiang2010 · 2022-07-28T14:33:46Z

I don't quite get this one, can you elaborate?

krfricke · 2022-07-28T14:43:57Z

I've looked into this and updated the issue description and title.

I think one of the main problems is that pbt both exploits and tries to pause/resume trials for which it has not received a checkpoint. These trials will always run from scratch.

The solution for the horovod test is to either checkpoint every step or only report once at the end of the iteration - I'll open a PR for that

krfricke · 2022-10-28T18:00:48Z

@justinvyu has this been resolved by #28509?

justinvyu · 2022-10-28T18:41:03Z

I believe that PR doesn't fix this issue. My understanding of this issue is that checkpoint_interval > perturbation_interval, which is part of the user-defined configuration. I don't have a good way of fixing the case where the user doesn't checkpoint in time for the first perturbation. I have added suggestions in updated examples/guides that recommends checkpointing on every perturbation, or more often than perturbations (ex: this guide). Perhaps we should log a warning and recommend a different checkpointing interval. However, one problem with function checkpointing is that the interval is not controlled by Tune.

I do have a PR that I'll post today that fixes a related issue, where PBT only pulls the checkpoint of trials in the upper quantile, but it's possible for a lower-quantile trial to get upgraded. In this case, my fix was to always save the checkpoint of each trial, since there's no guarantees of which trial will need to be exploited in the next perturbation round.

krfricke · 2023-01-20T00:26:12Z

@justinvyu with your improvements to PBT, this should be resolved, right?

krfricke · 2023-04-13T12:47:45Z

Closing this for now

krfricke added P1 Issue that should be fixed within a few weeks tune Tune-related issues labels Jul 28, 2022

krfricke assigned krfricke and xwjiang2010 Jul 28, 2022

krfricke mentioned this issue Jul 28, 2022

[tune] long_running_horovod_tune_test errors with file not found error #27165

Closed

4 tasks

krfricke changed the title ~~[tune] PBT potentially exploits trials that are currently being restarted~~ [tune] PBT exploits trials that have no checkpoint, yet Jul 28, 2022

krfricke assigned justinvyu Oct 28, 2022

richardliaw added the ray-team-created Ray Team created label Dec 21, 2022

krfricke closed this as completed Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] PBT exploits trials that have no checkpoint, yet #27170

[tune] PBT exploits trials that have no checkpoint, yet #27170

krfricke commented Jul 28, 2022 •

edited

Loading

xwjiang2010 commented Jul 28, 2022

krfricke commented Jul 28, 2022

krfricke commented Oct 28, 2022

justinvyu commented Oct 28, 2022

krfricke commented Jan 20, 2023

krfricke commented Apr 13, 2023

[tune] PBT exploits trials that have no checkpoint, yet #27170

[tune] PBT exploits trials that have no checkpoint, yet #27170

Comments

krfricke commented Jul 28, 2022 • edited Loading

xwjiang2010 commented Jul 28, 2022

krfricke commented Jul 28, 2022

krfricke commented Oct 28, 2022

justinvyu commented Oct 28, 2022

krfricke commented Jan 20, 2023

krfricke commented Apr 13, 2023

krfricke commented Jul 28, 2022 •

edited

Loading