-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] PBT exploits trials that have no checkpoint, yet #27170
Comments
I don't quite get this one, can you elaborate? |
I've looked into this and updated the issue description and title. I think one of the main problems is that pbt both exploits and tries to pause/resume trials for which it has not received a checkpoint. These trials will always run from scratch. The solution for the horovod test is to either checkpoint every step or only report once at the end of the iteration - I'll open a PR for that |
@justinvyu has this been resolved by #28509? |
I believe that PR doesn't fix this issue. My understanding of this issue is that I do have a PR that I'll post today that fixes a related issue, where PBT only pulls the checkpoint of trials in the upper quantile, but it's possible for a lower-quantile trial to get upgraded. In this case, my fix was to always save the checkpoint of each trial, since there's no guarantees of which trial will need to be exploited in the next perturbation round. |
@justinvyu with your improvements to PBT, this should be resolved, right? |
Closing this for now |
It seems that PBT sometimes tries to exploit trials that are currently being restarted or have just been restarted with old statistics.PBT tries to exploit trials that never checkpointed. While this could be acceptable for hyperparameter exploitation, it will not progress training.
We should investigate this to see if this is actually the case.
On a different note, PBT pauses function trainable trials that never checkpointed and tries to resume them later. This doesn't work as there is no checkpoint to resume from.
The text was updated successfully, but these errors were encountered: