-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] Support large checkpoints and other arguments #28826
[AIR] Support large checkpoints and other arguments #28826
Conversation
Signed-off-by: Amog Kamsetty <[email protected]>
This is great! I am a bit worried this will make subclassing difficult. Could we perhaps apply the def _with_parameters(self, trainable_cls, **config):
return tune.with_parameters(trainable_cls, **config)
def as_trainable(self):
...
return self._with_parameters(Trainable, **config) That way if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big fan of this change, but it seems it break some tests
…support-large-checkpoints
Hmm @Yard1, I don't think |
I think we should make |
Is your suggestion specifically needed for HuggingfaceTrainer @Yard1? |
Signed-off-by: Amog Kamsetty <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Signed-off-by: Amog Kamsetty <[email protected]>
@@ -339,6 +339,11 @@ def setup(self, config): | |||
setup_kwargs[k] = parameter_registry.get(prefix + k) | |||
super(_Inner, self).setup(config, **setup_kwargs) | |||
|
|||
# Workaround for actor name not being logged correctly | |||
# if __repr__ is not directly defined in a class. | |||
def __repr__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah is this always an issue then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still hasn't been fixed in core afaik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a ton for the change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Amog Kamsetty <[email protected]>
Signed-off-by: Amog Kamsetty [email protected] Previously the arguments passed to the Trainer would be captured in the Trainable context. For arguments that are very large in size, this would prevent the Trainable from being registered due to gRPC resource limits. Instead, we now always use tune.with_parameters to save the Trainer arguments in the object store rather than capturing it in the context. Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Amog Kamsetty [email protected]
Resolves https://discuss.ray.io/t/resuming-training-from-big-models-in-ray-train-leads-to-grcp-error/7652.
Previously the arguments passed to the Trainer would be captured in the Trainable context. For arguments that are very large in size, this would prevent the Trainable from being registered due to gRPC resource limits:
Instead, we now always use
tune.with_parameters
to save the Trainer arguments in the object store rather than capturing it in the context.The test fails before these changes, but passes afterwards.
Failing test before the changes:
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.