Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Add Scaling Config validation #23889

Merged
merged 20 commits into from
Apr 19, 2022

Conversation

Yard1
Copy link
Member

@Yard1 Yard1 commented Apr 13, 2022

Why are these changes needed?

Adds a ScalingConfigDataClass.validate_config classmethod to allow for a generic way of validating ScalingConfigs by allowing only certain keys.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Comment on lines 134 to 137
scaling_config_arg_name: Name of the ScalingConfig argument to be used
in the exception message.
exc_obj_name: Name of the object calling this method to be used
in the exception message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the previous PR, let's remove these arguments. The enclosing method can re-raise the exception so that the affected key names are in the stack trace, but we should avoid passing strings just for raising error messages

Copy link
Member Author

@Yard1 Yard1 Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can return a list of bad keys instead, so that the enclosing method can raise an exception with a nice message. I think a single exception would be better from the user perspective than a long stack trace

Copy link
Contributor

@krfricke krfricke Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be open to raise a custom exception here that stores the keys as an attribute or a hint on how to resolve things.

ray.air.exceptions.ConfigError or so

@Yard1 Yard1 requested a review from krfricke April 13, 2022 16:56
@Yard1
Copy link
Member Author

Yard1 commented Apr 13, 2022

@krfricke updated, PTAL

python/ray/ml/config.py Show resolved Hide resolved
python/ray/ml/config.py Outdated Show resolved Hide resolved
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Yard1, this looks great so far! I think we should also do some more refactoring to make the allowed keys explicit for each Trainer

python/ray/ml/config.py Outdated Show resolved Hide resolved
python/ray/ml/config.py Outdated Show resolved Hide resolved
python/ray/ml/config.py Outdated Show resolved Hide resolved
python/ray/ml/config.py Outdated Show resolved Hide resolved
python/ray/ml/config.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @amogkam 's suggestions. I think you want this end state right?

class Trainer:
   _scaling_config_supported_keys = []  # Not scalable by default
   
   @classmethod
   def validate_config(scaling_config):
       ... default validator ...
class XGBoostTrainer:
   _scaling_config_supported_keys = ["num_workers", "use_gpu"]

I'd also like to see this PR show the whole integration path with Trainer, rather than adding a utility function without showing how works with other classes. This will make it easier to review the interfaces rather than the implementation.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 13, 2022
python/ray/ml/config.py Outdated Show resolved Hide resolved
python/ray/ml/config.py Outdated Show resolved Hide resolved
python/ray/ml/config.py Outdated Show resolved Hide resolved
python/ray/ml/config.py Outdated Show resolved Hide resolved
python/ray/ml/config.py Outdated Show resolved Hide resolved
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the PR to address the comments from the last feedback round:

  1. The validation method has been split up in a dict and dataclass part
  2. It has been moved to a utility function and generalized to arbitrary dataclass and dict objects
  3. The Trainer now has a private method to return a validate scaling config dataclass
  4. The code is being used in the DP/GBDP trainer classes.

I'll also update the sklearn trainer to utilize this. PTAL

@Yard1 Yard1 changed the title [AIR] Add ScalingConfigDataClass.validate_config [AIR] Add Scaling Config validation Apr 18, 2022
Copy link
Member Author

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @krfricke!

python/ray/ml/trainer.py Outdated Show resolved Hide resolved
Comment on lines +261 to +263
scaling_config_dataclass = self._validate_and_get_scaling_config_data_class(
self.scaling_config
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check be in Trainer itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine, we already call it in Trainer as well (in as_trainable). This is more of a way to get the dataclass inside the training loop

@Yard1 Yard1 requested review from amogkam and krfricke April 19, 2022 08:29
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for making the changes!

@amogkam
Copy link
Contributor

amogkam commented Apr 19, 2022

Needs an approval from @krfricke before we can merge.

@Yard1 Yard1 requested a review from matthewdeng April 19, 2022 19:41
@amogkam amogkam merged commit 1fc6db3 into ray-project:master Apr 19, 2022
@Yard1 Yard1 deleted the scaling_config_validate_config branch April 19, 2022 20:05
amogkam pushed a commit that referenced this pull request Apr 19, 2022
Implements `SklearnTrainer` and `SklearnPredictor`. Full parallelism with joblib + support for GPU enabled estimators like cuML.

Interface has been modified slightly by addition of several arguments, which were required for full functionality.

I haven't tested cuML yet, will do it later.

Depends on #23889

Co-authored-by: Kai Fricke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants