Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't error when DropFeatures get's empty list #567

Closed
sTomerG opened this issue Nov 23, 2022 · 7 comments
Closed

Don't error when DropFeatures get's empty list #567

sTomerG opened this issue Nov 23, 2022 · 7 comments

Comments

@sTomerG
Copy link

sTomerG commented Nov 23, 2022

Is your feature request related to a problem? Please describe.
To be able to more easily gridsearch over which features to drop using DropFeatures it would be helpful if DropFeatures wouldn't generate a ValueError when len(features_to_drop) == 0, so that you can also choose to not drop any features in the gridsearch by inserting an empty list as value.

Describe the solution you'd like
remove or len(features_to_drop) == 0 from line 58 in drop_features.py

Describe alternatives you've considered
Forking the library and removing the line of code myself (which I already did), but I might think this improvement could help other people as well.

@solegalli
Copy link
Collaborator

Hi @sTomerG

Thanks for the suggestion. This is a use of DropFeatures that I had not thought of :)

Going back to my thinking when I designed this class: it was intended for users to remove features based on some domain knowledge. Thus, calling this class and removing no feature (empty list) made no sense. Hence, the error check.

Of course I did not see the potential use in combination with GridSearch.

To try and understand how common this would be, could you explain a bit more of the scenario in which you use it together with a gridsearch? I guess you try different feature combinations, plus the empty list. Why would you enter different feature combinations manually?

Re the code change: I'd say, if you already created a solution (removing the check + test), then go ahead and make a PR.

@sTomerG
Copy link
Author

sTomerG commented Nov 24, 2022

Thanks for your quick reply @solegalli :). Now that I've thought about it some more; this actually applies to more classes in feature-engine which have a parameter that specifies on which variables the transformation should be applied.

With categorical encoding, you might want to gridsearch whether the model learns more of the feature through ordinal encoding or through mean encoding. This would be possible if the parametervariables could be set to an empty list without generating an error (now it generates ValueError: list of variables is empty.), so that you could e.g. create a GridSearch with:

{
    [
          (...)
        "(...)__meanencoder__variables": [ ["country"] ],
        "(...)__ordinalencoder__variables": [ [] ],
         (...)
    ],
    [
         (...)
        "(...)__meanencoder__variables": [ [] ],
        "(...)__ordinalencoder__variables": [ ["country"] ],
         (...)
    ],
}

This would result in variables taking:
None -> transform automatically chosen variables based on their data type
str | List[str] (etc) -> transform specified variables
[] -> don't transform variables

To me, this feels like it makes sense because variables are "The list of categorical variables that will be encoded.", hence: empty list, no variables encoded.

@solegalli
Copy link
Collaborator

I see the point for automation. But I also think that removing the "safety net" of alerting the user when no variables will be transformed by the encoder is too dangerous, i.e., the user may think that their variables are being transformed, but they are not.

There could be options with Optuna to sample the param space with no variables. Optuna allows to define parameters with the define-by-run method. If you have a real code example, we could try to figure out how to do it, and then maybe create some example in the documentation for others to look at.

@sTomerG
Copy link
Author

sTomerG commented Nov 29, 2022

Hi!

I've found a solution; I've created a meta-estimator which can turn other estimators/transformers in a Pipeline on or off :)

class EstimatorSwitch(TransformerMixin, BaseEstimator):
    """A Meta Estimator that can turn on or off an estimator \
        in a scikit-learn Pipeline.

    Parameters
    ----------
    estimator : Any
        The estimator to turn on or off.
    apply : bool
        To apply the estimator, by default True.
        
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.pipeline import make_pipeline
    >>> from sklearn.impute import SimpleImputer, MissingIndicator
    >>> from extra_ds_tools.ml.sklearn.meta_estimators import EstimatorSwitch
    >>> X = np.array([np.nan, 10] * 2).reshape(-1, 1)
    >>> # The SimpleImputer should transform the nans to the mean, which is 10.
    >>> pipeline = make_pipeline(EstimatorSwitch(SimpleImputer(), apply=False))
    >>> print(pipeline.fit_transform(X))
    [[nan]
    [10.]
    [nan]
    [10.]]
    """  # noqa

    def __init__(self, estimator: Any, *, apply: bool = True):
        self.estimator = estimator
        self.apply = apply

    def fit(
        self,
        X: Union[pd.DataFrame, pd.Series, np.array],
        y: Optional[Union[pd.Series, np.array]] = None,
        **fit_params
    ):
        """Fits the estimator on the data if self.apply == True.

        Parameters
        ----------
        X : Union[pd.DataFrame, pd.Series, np.array]
            Train data.
        y : Optional[Union[pd.Series, np.array]], optional
            Target data, by default None.

        Returns
        -------
        EstimatorSwitch
            EstimatorSwitch with a fitted estimator if self.apply == True.
        """
        if self.apply:
            return self.estimator.fit(X, y, **fit_params)
        return self

    def transform(
        self,
        X: Union[pd.DataFrame, pd.Series, np.array],
        y: Optional[Union[pd.Series, np.array]] = None,
        **fit_params
    ):
        """Returns the transformed X if self.apply == True.

        Parameters
        ----------
         X : Union[pd.DataFrame, pd.Series, np.array]
             Train data.
         y : Optional[Union[pd.Series, np.array]], optional
             Target data, by default None.

        Returns
        -------
        Union[pd.DataFrame, pd.Series, np.array]
             Transformed X.
        """
        if self.apply:
            return self.estimator.transform(X, **fit_params)
        return X

@solegalli
Copy link
Collaborator

I guess this is resolved now, so I am closing this issue. Thanks for sharing your implementation and solution @sTomerG

@sTomerG
Copy link
Author

sTomerG commented Dec 7, 2022

Yes closing is fine! @solegalli If you're interested, in this article I've explained strategies on how to use feature-engine's transformers in a grid search :)

@solegalli
Copy link
Collaborator

Thank you @sTomerG ! I shared on social :p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants