-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune/rllib] Fix tune cloud tests for function and rllib trainables #20536
Conversation
@@ -1127,7 +1129,8 @@ def _process_trial_save(self, trial): | |||
trial=trial, | |||
checkpoint=trial.saving_to) | |||
trial.on_checkpoint(trial.saving_to) | |||
self.trial_executor.try_checkpoint_metadata(trial) | |||
if trial.checkpoint.storage != Checkpoint.MEMORY: | |||
self.trial_executor.mark_trial_to_checkpoint(trial) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this functionality not part of runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that should be moved into the runner eventually
Remaining flakiness seems to come from sync problems:
I'll kick off another run here, after which we should be able to merge. The syn problem should probably be tackled in a separate PR. |
will this happen for people's production workload as well? |
def testTrialNoCheckpointSave(self): | ||
"""Check that non-checkpointing trials *are* saved.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardliaw quick question: In this PR, we change the behavior that even non-checkpointing trials are saved. The reason is that we often got out-of-sync checkpoints (the FS sync already synchronized new checkpoints to the driver, but the trial metadata has not been updated on the driver side, yet).
Also, I don't see why we wouldn't want to save intermediate state of non-checkpointing trials. It seems to me like the main reason here would be that we can't restore these trials. However, in the case of early-exiting an experiment via keyboard interrupt, we sometimes may want to analyze reported resuls until that time.
Thus, I quickly wanted to confirm if there is any other reason why we specifically want trials not to be saved when they are not checkpointing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean to save an intermediate trial without checkpoints? Do you mean just to move the trial folder back to local host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Saving" in that sense is to store its metadata (e.g. last results) in the experiment checkpoint (trial runner checkpoint)
that sounds good, though just worried that there may be some confusion
internally
…On Mon, Nov 22, 2021 at 2:17 PM Kai Fricke ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In python/ray/tune/tests/test_trial_runner_3.py
<#20536 (comment)>:
> + def testTrialNoCheckpointSave(self):
+ """Check that non-checkpointing trials *are* saved."""
"Saving" in that sense is to store its metadata (e.g. last results) in the
experiment checkpoint (trial runner checkpoint)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20536 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCRZZOOTW5DFLPWVAGOZDTUNK6OXANCNFSM5IJULAWA>
.
|
# Conflicts: # python/ray/tune/trial_runner.py
Why are these changes needed?
We currently enforce strict checkpoint checking in the cloud test. However, we sometimes run into a race condition when trials are interrupted: A remote trial might already have progressed to a new checkpoint (and deleted old ones), but the trial scheduler has not yet received the latest result in order to add this information to the experiment checkpoint.
Additionally, there was a bug in the trial runner where trial metadata was not updated upon receiving a new result. This also lead to problems with interrupted training runs.
This PR relaxes the checkpoint requirement to also consider "too new" checkpoints if necessary, and removes the outdated
try_checkpoint_metadata
method to be replaced with the trial cache set.Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.