-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] deflake pbt. #21366
[tune] deflake pbt. #21366
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great discovery, thanks!
self._cur_order += 1 | ||
checkpoint.order = self._cur_order | ||
self.on_checkpoint_internal(checkpoint) | ||
|
||
def on_checkpoint_internal(self, checkpoint): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the on_checkpoint_internal
function here? Seems like we never call it anywhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right, making it private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of splitting out on_checkpoint_internal
into a separate method altogether?
What logic should be contained in the external method vs. the internal method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, would also vote to remove the function completely and just keep on_checkpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. combining the two.
To fix the build errors, merge current master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, merging after tests pass
Why are these changes needed?
We use
trial.checkpoint
to restore a perturbed trial. Currently trial.checkpoint is looking at both in-memory and persistent checkpoints to find the most recent one. The definition of "the most recent one" is based on iteration. This may no longer be a valid assumption in PBT case, consideringtrial_low_quantile
may have an iter=2_persistent_checkpoint as well as a iter=1_in_memory_checkpoint (perturbed fromtrial_upper_quantile
).BTW, I think there are a few other issues to clean up for PBT case, depending on how important we think PBT is. For example,
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.