-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Optional forcible trial cleanup, return default autofilled metrics even if Trainable doesn't report at least once #19144
Conversation
Thanks for picking this up. Can you add a screenshot of what the autofilled metrics would look like? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far!
@@ -123,15 +130,27 @@ def cleanup(self, partial: bool = True): | |||
If partial=False, all futures are expected to return. If a future | |||
does not return within the timeout period, the cleanup terminates. | |||
""" | |||
# At this point, self._cleanup_map holds the last references |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krfricke Do you know why we return_pg
before calling trainable.stop()
?
If pg is gone, it doesn't make sense to stop the actor anymore?
Also if we can rely on GC to always clean up remote actor, why do we bother with self._cleanup_map
anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if we can rely on GC to always clean up remote actor, why do we bother with self._cleanup_map anyways?
We want to have a way to keep a reference for the actor during graceful termination so that the cleanup method may finish
LGTM. Just some questions for my own understanding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks!
Why are these changes needed?
This PR adds two features:
TUNE_FORCE_TRIAL_CLEANUP
.Related issue number
Closes #18745
Checks
scripts/format.sh
to lint the changes in this PR.