Don't error when DropFeatures get's empty list #567

sTomerG · 2022-11-23T11:22:21Z

Is your feature request related to a problem? Please describe.
To be able to more easily gridsearch over which features to drop using DropFeatures it would be helpful if DropFeatures wouldn't generate a ValueError when len(features_to_drop) == 0, so that you can also choose to not drop any features in the gridsearch by inserting an empty list as value.

Describe the solution you'd like
remove or len(features_to_drop) == 0 from line 58 in drop_features.py

Describe alternatives you've considered
Forking the library and removing the line of code myself (which I already did), but I might think this improvement could help other people as well.

The text was updated successfully, but these errors were encountered:

solegalli · 2022-11-23T15:27:15Z

Hi @sTomerG

Thanks for the suggestion. This is a use of DropFeatures that I had not thought of :)

Going back to my thinking when I designed this class: it was intended for users to remove features based on some domain knowledge. Thus, calling this class and removing no feature (empty list) made no sense. Hence, the error check.

Of course I did not see the potential use in combination with GridSearch.

To try and understand how common this would be, could you explain a bit more of the scenario in which you use it together with a gridsearch? I guess you try different feature combinations, plus the empty list. Why would you enter different feature combinations manually?

Re the code change: I'd say, if you already created a solution (removing the check + test), then go ahead and make a PR.

sTomerG · 2022-11-24T09:40:37Z

Thanks for your quick reply @solegalli :). Now that I've thought about it some more; this actually applies to more classes in feature-engine which have a parameter that specifies on which variables the transformation should be applied.

With categorical encoding, you might want to gridsearch whether the model learns more of the feature through ordinal encoding or through mean encoding. This would be possible if the parametervariables could be set to an empty list without generating an error (now it generates ValueError: list of variables is empty.), so that you could e.g. create a GridSearch with:

{
    [
          (...)
        "(...)__meanencoder__variables": [ ["country"] ],
        "(...)__ordinalencoder__variables": [ [] ],
         (...)
    ],
    [
         (...)
        "(...)__meanencoder__variables": [ [] ],
        "(...)__ordinalencoder__variables": [ ["country"] ],
         (...)
    ],
}

This would result in variables taking:
None -> transform automatically chosen variables based on their data type
str | List[str] (etc) -> transform specified variables
[] -> don't transform variables

To me, this feels like it makes sense because variables are "The list of categorical variables that will be encoded.", hence: empty list, no variables encoded.

solegalli · 2022-11-24T10:11:51Z

I see the point for automation. But I also think that removing the "safety net" of alerting the user when no variables will be transformed by the encoder is too dangerous, i.e., the user may think that their variables are being transformed, but they are not.

There could be options with Optuna to sample the param space with no variables. Optuna allows to define parameters with the define-by-run method. If you have a real code example, we could try to figure out how to do it, and then maybe create some example in the documentation for others to look at.

sTomerG · 2022-11-29T12:53:43Z

Hi!

I've found a solution; I've created a meta-estimator which can turn other estimators/transformers in a Pipeline on or off :)

class EstimatorSwitch(TransformerMixin, BaseEstimator):
    """A Meta Estimator that can turn on or off an estimator \
        in a scikit-learn Pipeline.

    Parameters
    ----------
    estimator : Any
        The estimator to turn on or off.
    apply : bool
        To apply the estimator, by default True.
        
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.pipeline import make_pipeline
    >>> from sklearn.impute import SimpleImputer, MissingIndicator
    >>> from extra_ds_tools.ml.sklearn.meta_estimators import EstimatorSwitch
    >>> X = np.array([np.nan, 10] * 2).reshape(-1, 1)
    >>> # The SimpleImputer should transform the nans to the mean, which is 10.
    >>> pipeline = make_pipeline(EstimatorSwitch(SimpleImputer(), apply=False))
    >>> print(pipeline.fit_transform(X))
    [[nan]
    [10.]
    [nan]
    [10.]]
    """  # noqa

    def __init__(self, estimator: Any, *, apply: bool = True):
        self.estimator = estimator
        self.apply = apply

    def fit(
        self,
        X: Union[pd.DataFrame, pd.Series, np.array],
        y: Optional[Union[pd.Series, np.array]] = None,
        **fit_params
    ):
        """Fits the estimator on the data if self.apply == True.

        Parameters
        ----------
        X : Union[pd.DataFrame, pd.Series, np.array]
            Train data.
        y : Optional[Union[pd.Series, np.array]], optional
            Target data, by default None.

        Returns
        -------
        EstimatorSwitch
            EstimatorSwitch with a fitted estimator if self.apply == True.
        """
        if self.apply:
            return self.estimator.fit(X, y, **fit_params)
        return self

    def transform(
        self,
        X: Union[pd.DataFrame, pd.Series, np.array],
        y: Optional[Union[pd.Series, np.array]] = None,
        **fit_params
    ):
        """Returns the transformed X if self.apply == True.

        Parameters
        ----------
         X : Union[pd.DataFrame, pd.Series, np.array]
             Train data.
         y : Optional[Union[pd.Series, np.array]], optional
             Target data, by default None.

        Returns
        -------
        Union[pd.DataFrame, pd.Series, np.array]
             Transformed X.
        """
        if self.apply:
            return self.estimator.transform(X, **fit_params)
        return X

solegalli · 2022-12-07T10:24:15Z

I guess this is resolved now, so I am closing this issue. Thanks for sharing your implementation and solution @sTomerG

sTomerG · 2022-12-07T13:48:10Z

Yes closing is fine! @solegalli If you're interested, in this article I've explained strategies on how to use feature-engine's transformers in a grid search :)

solegalli · 2022-12-08T09:16:08Z

Thank you @sTomerG ! I shared on social :p

solegalli closed this as completed Dec 7, 2022

solegalli mentioned this issue Jan 21, 2023

Request: skipping without error when there are no variables to transform #599

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't error when DropFeatures get's empty list #567

Don't error when DropFeatures get's empty list #567

sTomerG commented Nov 23, 2022

solegalli commented Nov 23, 2022

sTomerG commented Nov 24, 2022

solegalli commented Nov 24, 2022

sTomerG commented Nov 29, 2022 •

edited

Loading

solegalli commented Dec 7, 2022

sTomerG commented Dec 7, 2022

solegalli commented Dec 8, 2022

Don't error when DropFeatures get's empty list #567

Don't error when DropFeatures get's empty list #567

Comments

sTomerG commented Nov 23, 2022

solegalli commented Nov 23, 2022

sTomerG commented Nov 24, 2022

solegalli commented Nov 24, 2022

sTomerG commented Nov 29, 2022 • edited Loading

solegalli commented Dec 7, 2022

sTomerG commented Dec 7, 2022

solegalli commented Dec 8, 2022

sTomerG commented Nov 29, 2022 •

edited

Loading