Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Enable preprocessing in reg_cocktails #369

Merged
merged 8 commits into from
Feb 9, 2022

Conversation

ravinkohli
Copy link
Contributor

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Note that a Pull Request should only contain one of refactoring, new features or documentation changes.
Please separate these changes and send us individual PRs for each.
For more information on how to create a good pull request, please refer to The anatomy of a perfect pull request.

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?
  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

Description

Following the discussion in last weeks meeting, we would like to have preprocessing as part of the pipeline in regularisation cocktails. Therefore this PR makes the necessary changes to achieve that. These include uncommenting the various preprocessor transforms as well as code for learning embeddings. Moreover, this PR also removes the is_small_preprocess attribute from BaseDataset as we will always preprocess early in the EarlyPreprocessing node.

Motivation and Context

It is required to allow searching for preprocessing hyperparameters in the reg_cocktails branch. Also, to avoid redundant preprocessing of the dataset for each epoch, we would like to preprocess the whole data all at once.

How has this been tested?

Previous tests for preprocessing and early preprocessing have been enabled. Also tests for embedding module have been enabled to check if embedding is compatible with the preprocessing approach

Copy link
Contributor

@nabenabe0928 nabenabe0928 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, I reviewed all the files except tabular_feature_validator.py related (two) files.

Copy link
Contributor

@nabenabe0928 nabenabe0928 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments on tabular_feature_validator.

autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved
autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved
autoPyTorch/data/tabular_feature_validator.py Show resolved Hide resolved
autoPyTorch/data/tabular_feature_validator.py Show resolved Hide resolved
Copy link
Contributor

@nabenabe0928 nabenabe0928 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the tabular_feature_validator

autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved
autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved
# columns are shifted to the left
list(range(len(cat)))
for cat in encoded_categories
]

# differently to categorical_columns and numerical_columns,
# this saves the index of the column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lines below will look better by this (len(enc_columns) > 0 for data containing categoricals, right?):

            num_numericals, num_categoricals = self.feat_type.count('numerical'), self.feat_type.count('categorical')
            if num_numericals + num_categoricals != len(self.feat_type):
                raise ValueError("Elements of feat_type must be either ['numerical', 'categorical']")

            self.categorical_columns = list(range(num_categoricals))
            self.numerical_columns = list(range(num_categoricals, num_categoricals + num_numericals))

autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved
autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved
autoPyTorch/data/tabular_feature_validator.py Show resolved Hide resolved
autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved
Comment on lines -402 to -403
if self.all_nan_columns is not None and column in self.all_nan_columns:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do not we need this anymore?

autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved
autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved
autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved
autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved
Comment on lines 528 to 538
if len(self.dtypes) != 0:
# when train data has no object dtype, but test does
# we prioritise the datatype given in training data
for column, data_type in zip(X.columns, self.dtypes):
X[column] = X[column].astype(data_type)
else:
# Calling for the first time to infer the categories
X = X.infer_objects()
for column, data_type in zip(X.columns, X.dtypes):
if not is_numeric_dtype(data_type):
X[column] = X[column].astype('category')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(self.dtypes) != 0:
# when train data has no object dtype, but test does
# we prioritise the datatype given in training data
for column, data_type in zip(X.columns, self.dtypes):
X[column] = X[column].astype(data_type)
else:
# Calling for the first time to infer the categories
X = X.infer_objects()
for column, data_type in zip(X.columns, X.dtypes):
if not is_numeric_dtype(data_type):
X[column] = X[column].astype('category')
elif len(self.dtypes) != 0: # when train data has no object dtype, but test does
# we prioritise the datatype given in training data
for column, data_type in zip(X.columns, self.dtypes):
X[column] = X[column].astype(data_type)
else: # Calling for the first time to infer the categories
X = X.infer_objects()
for column, data_type in zip(X.columns, X.dtypes):
if not is_numeric_dtype(data_type):
X[column] = X[column].astype('category')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are just preferences on where to start the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nono, it actually removed an indent level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, if you notice we are also saving the dtypes in self.object_dtype_mapping which should be done for both of the two conditions you moved back an indent level. So, I think its fine the way it is.

Copy link
Contributor

@nabenabe0928 nabenabe0928 Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I did not notice, but I also did not notice that we still have the same issue (which happens when we have a huge number of features) in this method.
Could you fix it?

        if hasattr(self, 'object_dtype_mapping'):
            # Mypy does not process the has attr. This dict is defined below
            try:
                X = X.astype(self.object_dtype_mapping)
            except Exception as e:
                self.logger.warning(f'Casting test data to data type in train data caused the exception {e}')
                pass
            return

        if len(self.dtypes) != 0:
            # when train data has no object dtype, but test does.  Prioritise the datatype given in training data
            dtype_dict = {col: dtype for col, dtype in zip(X.columns, self.dtypes)}
            X = X.astype(dtype_dict)
        else:
            # Calling for the first time to infer the categories
            X = X.infer_objects()
            dtype_dict = {col: 'category' for col, dtype in zip(X.columns, X.dtypes) if not is_numeric_dtype(dtype)}
            X = X.astype(dtype_dict)

        # only numerical attributes and categories
        self.object_dtype_mapping = {col: dtype for col, dtype in zip(X.columns, X.dtypes)}

@nabenabe0928
Copy link
Contributor

Hey, thanks for your effort.
I think we can merge once you addressed this.

@ravinkohli ravinkohli added the enhancement New feature or request label Feb 3, 2022
@nabenabe0928
Copy link
Contributor

Hi, thanks for the changes and sorry for the late response.
I will approve the changes.

@nabenabe0928
Copy link
Contributor

Looks good to me now:)

@ArlindKadra
Copy link

Thanks for the PR :). It looks good to me too now. I only have one last minor question.

@ravinkohli ravinkohli merged commit cca08d5 into reg_cocktails Feb 9, 2022
ravinkohli added a commit that referenced this pull request Feb 28, 2022
* enable preprocessing and remove is_small_preprocess

* address comments from shuhei and fix precommit checks

* fix tests

* fix precommit checks

* add suggestions from shuhei for astype use

* address speed issue when using object_dtype_mapping

* make code more readable

* improve documentation for base network embedding
ravinkohli added a commit that referenced this pull request Feb 28, 2022
* enable preprocessing and remove is_small_preprocess

* address comments from shuhei and fix precommit checks

* fix tests

* fix precommit checks

* add suggestions from shuhei for astype use

* address speed issue when using object_dtype_mapping

* make code more readable

* improve documentation for base network embedding
ravinkohli added a commit that referenced this pull request Mar 9, 2022
* enable preprocessing and remove is_small_preprocess

* address comments from shuhei and fix precommit checks

* fix tests

* fix precommit checks

* add suggestions from shuhei for astype use

* address speed issue when using object_dtype_mapping

* make code more readable

* improve documentation for base network embedding
ravinkohli added a commit to ravinkohli/Auto-PyTorch that referenced this pull request Apr 12, 2022
* enable preprocessing and remove is_small_preprocess

* address comments from shuhei and fix precommit checks

* fix tests

* fix precommit checks

* add suggestions from shuhei for astype use

* address speed issue when using object_dtype_mapping

* make code more readable

* improve documentation for base network embedding
@ravinkohli ravinkohli deleted the fix_preprocessing branch June 15, 2022 13:15
ravinkohli added a commit that referenced this pull request Jul 26, 2022
* enable preprocessing and remove is_small_preprocess

* address comments from shuhei and fix precommit checks

* fix tests

* fix precommit checks

* add suggestions from shuhei for astype use

* address speed issue when using object_dtype_mapping

* make code more readable

* improve documentation for base network embedding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants