-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Only sync down from cloud if needed #26725
[tune] Only sync down from cloud if needed #26725
Conversation
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
This reverts commit 005b187.
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
@@ -487,7 +490,8 @@ def save(self, checkpoint_dir: Optional[str] = None) -> str: | |||
TrainableUtil.write_metadata(checkpoint_dir, metadata) | |||
|
|||
# Maybe sync to cloud | |||
self._maybe_save_to_cloud(checkpoint_dir) | |||
if not prevent_upload: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about syncing to driver?
I recall seeing these tmp folders on driver...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, out of scope for this PR :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'll address that in a different PR - this PR actually makes PBT fail while syncing to driver is not great but at least the experiment will continue to run).
let’s revisit this whole pbt/syncing business after the release push. Ideally I would like to also add a e2e test using the user's repro script, if it makes sense for us. |
Currently, trainables will try to sync up/down temporary checkpoints from cloud storage, leading to errors. These erros come up e.g. with PBT, which heavily uses saving/restoring from objects. Instead, we should not sync these temporary checkpoints up at all, and we should generally not sync down if a local checkpoint directory exists, which will prevent us also from trying to sync down non-existent temporary checkpoint directories. See ray-project#26714 Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Rohan138 <[email protected]>
Currently, trainables will try to sync up/down temporary checkpoints from cloud storage, leading to errors. These erros come up e.g. with PBT, which heavily uses saving/restoring from objects. Instead, we should not sync these temporary checkpoints up at all, and we should generally not sync down if a local checkpoint directory exists, which will prevent us also from trying to sync down non-existent temporary checkpoint directories. See ray-project#26714 Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Stefan van der Kleij <[email protected]>
Signed-off-by: Kai Fricke [email protected]
Why are these changes needed?
Currently, trainables will try to sync up/down temporary checkpoints from cloud storage, leading to errors. These erros come up e.g. with PBT, which heavily uses saving/restoring from objects.
Instead, we should not sync these temporary checkpoints up at all, and we should generally not sync down if a local checkpoint directory exists, which will prevent us also from trying to sync down non-existent temporary checkpoint directories.
See #26714
Related issue number
Closes #26714
Related PR for 1.13.1: #26717
Checks
scripts/format.sh
to lint the changes in this PR.