-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib; Tune] Fix WandB metric overlap after restore from checkpoint. #46897
[RLlib; Tune] Fix WandB metric overlap after restore from checkpoint. #46897
Conversation
Signed-off-by: sven1977 <[email protected]>
…wandb_metric_overlap_after_restore_from_checkpoint
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me!
By the way, this will only fix the issue for the setup_wandb
flow where users call that directly in their code. This does not solve the WandbLoggerCallback
case -- should we fix it there too?
Nevermind. this is the exact line of code that was changed.
Yeah, that's what I thought, too. Thanks for clarifying and for your review @justinvyu ! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great improvement for users imo. Checkpointing is so important and should work as clean as possible for good user experience.
@@ -15,6 +15,7 @@ | |||
from ray._private.storage import _load_class | |||
from ray.air import session | |||
from ray.air._internal import usage as air_usage | |||
from ray.air.constants import TRAINING_ITERATION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If air goes anyways shouldn't this class maybe move over to train
or tune
?
@@ -415,7 +416,7 @@ def run(self): | |||
log, config_update = self._handle_result(item_content) | |||
try: | |||
self._wandb.config.update(config_update, allow_val_change=True) | |||
self._wandb.log(log) | |||
self._wandb.log(log, step=log.get(TRAINING_ITERATION)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet.
Fix WandB metric overlap after restore from checkpoint.
This PR fixes the following problem:
Tuner.fit()
, checkpoints every 100 iterations, and uses the WandB integration.Tuner.resume()
to resume it from the last checkpoint (which would have happened at iteration 100).To fix this, we'll make sure to always send the explicit
step
along with eachwandb.log()
call.Note that upon restoring from a checkpoint, RLlib already properly sets
Trainable.training_iteration
to the value stored in the last checkpoint. OtherTrainable
subclasses might have to implement a similar logic in their checkpointing behavior or we'll eventually add this to the base Trainable class.Note also that this PR does NOT affect the behavior for users using the
setup_wandb()
utility and then log WandB metrics manually (for example in their Ray Train training functions).Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.