-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Remove redundant categorical imputation #375
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,13 +48,16 @@ def fit(self, X: Dict[str, Any], y: Any = None) -> "TabularColumnTransformer": | |
"TabularColumnTransformer": an instance of self | ||
""" | ||
self.check_requirements(X, y) | ||
numerical_pipeline = 'drop' | ||
categorical_pipeline = 'drop' | ||
# in case the preprocessing steps are disabled | ||
# i.e, NoEncoder for categorical, we want to | ||
# let the data in categorical columns pass through | ||
numerical_pipeline = 'passthrough' | ||
categorical_pipeline = 'passthrough' | ||
|
||
preprocessors = get_tabular_preprocessers(X) | ||
if len(X['dataset_properties']['numerical_columns']): | ||
if len(X['dataset_properties']['numerical_columns']) and len(preprocessors['numerical']): | ||
numerical_pipeline = make_pipeline(*preprocessors['numerical']) | ||
if len(X['dataset_properties']['categorical_columns']): | ||
if len(X['dataset_properties']['categorical_columns']) and len(preprocessors['categorical']): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These lines could be made more clear, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, I agree. actually, we don't need to check for the length of columns as we will only have preprocessors if the length of columns is > 0. So, I'll change these conditions. |
||
categorical_pipeline = make_pipeline(*preprocessors['categorical']) | ||
|
||
self.preprocessor = ColumnTransformer([ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,70 +13,42 @@ | |
|
||
|
||
class SimpleImputer(BaseImputer): | ||
"""An imputer for categorical and numerical columns | ||
|
||
Impute missing values for categorical columns with 'constant_!missing!' | ||
|
||
Note: | ||
In case of numpy data, the constant value is set to -1, under the assumption | ||
that categorical data is fit with an Ordinal Scaler. | ||
""" | ||
An imputer for numerical columns | ||
|
||
Attributes: | ||
random_state (Optional[np.random.RandomState]): | ||
The random state to use for the imputer. | ||
numerical_strategy (str: default='mean'): | ||
The strategy to use for imputing numerical columns. | ||
Can be one of ['most_frequent', 'constant_!missing!'] | ||
categorical_strategy (str: default='most_frequent') | ||
The strategy to use for imputing categorical columns. | ||
Can be one of ['mean', 'median', 'most_frequent', 'constant_zero'] | ||
""" | ||
|
||
def __init__( | ||
self, | ||
random_state: Optional[np.random.RandomState] = None, | ||
numerical_strategy: str = 'mean', | ||
categorical_strategy: str = 'most_frequent' | ||
): | ||
""" | ||
Note: | ||
'constant' as numerical_strategy uses 0 as the default fill_value while | ||
'constant_!missing!' uses a fill_value of -1. | ||
This behaviour should probably be fixed. | ||
""" | ||
super().__init__() | ||
self.random_state = random_state | ||
self.numerical_strategy = numerical_strategy | ||
self.categorical_strategy = categorical_strategy | ||
|
||
def fit(self, X: Dict[str, Any], y: Optional[Any] = None) -> BaseImputer: | ||
""" Fits the underlying model and returns the transformed array. | ||
""" | ||
Builds the preprocessor based on the given fit dictionary 'X'. | ||
|
||
Args: | ||
X (np.ndarray): | ||
The input features to fit on | ||
y (Optional[np.ndarray]): | ||
The labels for the input features `X` | ||
X (Dict[str, Any]): | ||
The fit dictionary | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might be good to give a little example dict in the documentation. |
||
y (Optional[Any]): | ||
Not Used -- to comply with API | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like this change to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, there is outdated documentation throughout the pipeline components. I'll raise an issue and fix them with a different PR. |
||
|
||
Returns: | ||
SimpleImputer: | ||
returns self | ||
self: | ||
returns an instance of self. | ||
""" | ||
self.check_requirements(X, y) | ||
|
||
# Choose an imputer for any categorical columns | ||
categorical_columns = X['dataset_properties']['categorical_columns'] | ||
|
||
if isinstance(categorical_columns, List) and len(categorical_columns) != 0: | ||
if self.categorical_strategy == 'constant_!missing!': | ||
# Train data is numpy as of this point, where an Ordinal Encoding is used | ||
# for categoricals. Only Numbers are allowed for `fill_value` | ||
imputer = SklearnSimpleImputer(strategy='constant', fill_value=-1, copy=False) | ||
self.preprocessor['categorical'] = imputer | ||
else: | ||
imputer = SklearnSimpleImputer(strategy=self.categorical_strategy, copy=False) | ||
self.preprocessor['categorical'] = imputer | ||
|
||
# Choose an imputer for any numerical columns | ||
numerical_columns = X['dataset_properties']['numerical_columns'] | ||
|
||
|
@@ -98,11 +70,6 @@ def get_hyperparameter_search_space( | |
value_range=("mean", "median", "most_frequent", "constant_zero"), | ||
default_value="mean", | ||
), | ||
categorical_strategy: HyperparameterSearchSpace = HyperparameterSearchSpace( | ||
hyperparameter='categorical_strategy', | ||
value_range=("most_frequent", "constant_!missing!"), | ||
default_value="most_frequent" | ||
) | ||
) -> ConfigurationSpace: | ||
"""Get the hyperparameter search space for the SimpleImputer | ||
|
||
|
@@ -112,8 +79,6 @@ def get_hyperparameter_search_space( | |
Note: Not actually Optional, just adhering to its supertype | ||
numerical_strategy (HyperparameterSearchSpace: default = ...) | ||
The strategy to use for numerical imputation | ||
caterogical_strategy (HyperparameterSearchSpace: default = ...) | ||
The strategy to use for categorical imputation | ||
|
||
Returns: | ||
ConfigurationSpace | ||
|
@@ -132,12 +97,6 @@ def get_hyperparameter_search_space( | |
): | ||
add_hyperparameter(cs, numerical_strategy, CategoricalHyperparameter) | ||
|
||
if ( | ||
isinstance(dataset_properties['categorical_columns'], List) | ||
and len(dataset_properties['categorical_columns']) | ||
): | ||
add_hyperparameter(cs, categorical_strategy, CategoricalHyperparameter) | ||
|
||
return cs | ||
|
||
@staticmethod | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -14,8 +14,7 @@ class BaseImputer(autoPyTorchTabularPreprocessingComponent): | |||||
def __init__(self) -> None: | ||||||
super().__init__() | ||||||
self.add_fit_requirements([ | ||||||
FitRequirement('numerical_columns', (List,), user_defined=True, dataset_property=True), | ||||||
FitRequirement('categorical_columns', (List,), user_defined=True, dataset_property=True)]) | ||||||
FitRequirement('numerical_columns', (List,), user_defined=True, dataset_property=True)]) | ||||||
|
||||||
def transform(self, X: Dict[str, Any]) -> Dict[str, Any]: | ||||||
""" | ||||||
|
@@ -26,7 +25,7 @@ def transform(self, X: Dict[str, Any]) -> Dict[str, Any]: | |||||
Returns: | ||||||
(Dict[str, Any]): the updated 'X' dictionary | ||||||
""" | ||||||
if self.preprocessor['numerical'] is None and self.preprocessor['categorical'] is None: | ||||||
if self.preprocessor['numerical'] is None and len(X["dataset_properties"]["numerical_columns"]) != 0: | ||||||
raise ValueError("cant call transform on {} without fitting first." | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "can not" is two words :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did not know that "can not" exists, but "cannot" is preferred for formal settings?
|
||||||
.format(self.__class__.__name__)) | ||||||
X.update({'imputer': self.preprocessor}) | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way makes more sense but it means there is no longer and attribute
self.initial_configurations
ifportfolio_selection
isNone
. While unlikely to matter, this means code likeif self.initial_configurations:
happening elsewhere will now break as it doesn't exist. I'd keep the default thing that it isNone
and given a valueif portfolio_selection is not None
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, thanks for pointing it out. I'll fix this.