-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train] Fix regression where large Trainer attributes get serialized along with actor class #43234
[train] Fix regression where large Trainer attributes get serialized along with actor class #43234
Conversation
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! One question: What's the change that caused the regression?
|
||
# Evaluate datasets if they are wrapped in a factory. | ||
trainer.datasets = { | ||
k: d() if callable(d) else d for k, d in self.datasets.items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the self reference that caused serialization of Huge BaseTrainer object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that's it!
Signed-off-by: Justin Yu <[email protected]>
…trainer_trainable_size_regression
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Running release test to verify fix: https://buildkite.com/ray-project/release/builds/8691 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these changes needed?
#43146 re-introduced a problem that was originally fixed by #28826.
The problem:
Trainer
's arguments, theTrainer
instance on the driver needs to be sure not to package itself (self
) when registering the actor class that will be run remotely.self
gets packaged along with the actor class during serialization, and so do these large objects.Trainer
instance (which serves as the dist. training coordinator), and because Ray Tune uses the Ray GCS redis key value store to save a shared copy of the serialized actor class, a huge actor class causes a gRPC error such as the one below:The fix is to remove the reference to
self
in the train coordinator function trainable. This PR also restructures the code a bit to make it harder to make such a mistake in the future.Future considerations
Related issue number
Closes #43191
Closes #43192
Closes #43193
Closes #43194
Closes #43195
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.