[ENH] Possibility to fit base_selector with one feature for automated pipelines. #566

MatheusHam · 2022-11-23T10:17:26Z

feature_engine/feature_engine/selection/base_selector.py

Lines 121 to 126 in f94cfaf

    
           def _check_variable_number(self) -> None: 
        
               """Check that there are multiple variables for the selectors to work with.""" 
        
               if len(self.variables_) < 2: 
        
                   raise ValueError( 
        
                       "The selector needs at least 2 or more variables to select from. " 
        
                       f"Got only 1 variable: {self.variables_}."

When using SuperVectorizer (dirty_cat) to develop an AutoML solution, I have bumped into a situation where a master table has only one numerical feature, thus raising an error here. In this case I was using specifically DropDuplicateFeatures and wanted the process to move on with the only feature given.

Proposal:

Add a parameter to base_selector that could prevent such raises for automated pipelines.

solegalli · 2022-11-23T15:32:50Z

Hi @MatheusHam

Thanks for your engagement in the project!

I don't understand the issue though. If you want to remove duplicated features from a dataset (which is what DropDuplicateFeatures is designed to do), you need at least a data set with 2 features. If the dataset has 1 feature, for sure, it won't be duplicated. So why would we use DropDuplicateFeatures on a dataframe with 1 variable?

Maybe there is something from your process that I am missing?. Maybe you could explain a bit more the end to end process? or link / add some code / examples?

Thank you!

MatheusHam · 2022-11-24T08:39:55Z

Absolutelly!

I have been working on an auto ml solution and we are applying a SuperVectorizer with some feature_engine selectors. The SuperVectorizer (dirty_cat) separates the features of our master by its types and applies a transformation pipeline, e.g:

    def get_vectorizer(
        self,
        datetime_pipeline: Encoders = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            DatetimeFeatures(
                missing_values="ignore", features_to_extract=DATETIME_FEATURES
            ),
        ),
        low_card_transformer: Encoders = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            SimpleImputer(
                missing_values=np.nan,
                add_indicator=True,
                strategy="constant",
                fill_value="missing",
            ),
            OrdinalEncoder(
                handle_unknown="use_encoded_value", unknown_value=-1
            ),
        ),
        high_card_transformer: Encoders = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            GapEncoder(hashing=True, random_state=RANDOM_STATE),
        ),
        numerical_transformer: NumTransformer = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            DropDuplicateFeatures(),
            SimpleImputer(
                missing_values=np.nan, add_indicator=True, strategy="median"
            ),
        ),
    ) -> Pipeline:
   
        return SuperVectorizer(
            auto_cast=True,
            n_jobs=N_CORES,
            low_card_cat_transformer=low_card_transformer,
            high_card_cat_transformer=high_card_transformer,
            numerical_transformer=numerical_transformer,
            datetime_transformer=datetime_pipeline,
            impute_missing="force",
            remainder="drop",
        )

Sometimes, a user tries to use our automated pipeline with a master_table that contains only one numerical feature thus raising the error mentioned above.

The idea is that for such automated cases when there is no selection to be done, like the example given above, the pipeline would continue with the single feature instead of raising. The solution could be a boolean parameter to continue execution in such cases.

Let me know if this makes sense. I appreciate your response!

solegalli · 2022-11-24T10:23:20Z

Cool example, thank you! Very interesting way of combining transformers :)

My first thought is, if we remote that check in the base_selector, it will affect other transformers as well, so we need to think carefully about the consequences of doing this.

Looking at your pipeline, I guess you pass different datasets (containing datetime vars, low cardinality vars, high cardinality vars and numerical vars) to each one of the params from get_vectorizer(), is this correct?

If this is the case, could you not add a switch to the pipeline to check for the length of the list with numerical variables, and if it is 1, then remove the selector from the pipeline?

Something like this:

    def get_vectorizer(
        self,
        datetime_pipeline: Encoders = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            DatetimeFeatures(
                missing_values="ignore", features_to_extract=DATETIME_FEATURES
            ),
        ),
        low_card_transformer: Encoders = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            SimpleImputer(
                missing_values=np.nan,
                add_indicator=True,
                strategy="constant",
                fill_value="missing",
            ),
            OrdinalEncoder(
                handle_unknown="use_encoded_value", unknown_value=-1
            ),
        ),
        high_card_transformer: Encoders = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            GapEncoder(hashing=True, random_state=RANDOM_STATE),
        ),
        numerical_transformer: NumTransformer = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            DropDuplicateFeatures(),
            SimpleImputer(
                missing_values=np.nan, add_indicator=True, strategy="median"
            ),
        switch=False,
        ),
    ) -> Pipeline:

         if switch is True:
              NumTransformer = make_pipeline(
                DropConstantFeatures(tol=0.998, missing_values="ignore"),
                SimpleImputer(
                missing_values=np.nan, add_indicator=True, strategy="median"
            ),

        return SuperVectorizer(
            auto_cast=True,
            n_jobs=N_CORES,
            low_card_cat_transformer=low_card_transformer,
            high_card_cat_transformer=high_card_transformer,
            numerical_transformer=numerical_transformer,
            datetime_transformer=datetime_pipeline,
            impute_missing="force",
            remainder="drop",
        )

MatheusHam · 2022-11-26T15:07:10Z

We do something similar to what you suggested, I appreciate your answer.

Yes, that would definitely require special attention. Some data analysts with little python experience are using our auto ml tool to experiment on their own, eventually, they can fail to build a good first master table for their use case and we do not want to impact their experience by raising errors they can't understand.

I could argue that if we are trying to apply specifically DropDuplicatedFeatures() and the given DataFrame contains just one column, then it should just move on with it because there are no duplicates. Even though I understand why it is reasonable to prevent this attempt, an option for the use I described does not seem so absurd.

The reason I suggested something on the base_selector (not changing the way it works right now, just adding an option for such checks that break pipelines)

If you still think it does not make sense, feel free to close it! I'll be back with contributions soon! 👍

solegalli · 2022-11-28T09:14:11Z

Hi @MatheusHam

Thanks for the input. I really appreciate it.

As it stands now, I get the impression that this would not be a wide spread enough use of the transformer for us to change the API.

We could pick it up at a later stage if there are more users that would like DropDuplicatedFetures() and / or other selectors not to drop the error on single column dataframes.

Thank you!

solegalli closed this as completed Dec 7, 2022

solegalli mentioned this issue Jan 21, 2023

Request: skipping without error when there are no variables to transform #599

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Possibility to fit base_selector with one feature for automated pipelines. #566

[ENH] Possibility to fit base_selector with one feature for automated pipelines. #566

MatheusHam commented Nov 23, 2022 •

edited

Loading

solegalli commented Nov 23, 2022

MatheusHam commented Nov 24, 2022

solegalli commented Nov 24, 2022 •

edited

Loading

MatheusHam commented Nov 26, 2022

solegalli commented Nov 28, 2022

[ENH] Possibility to fit base_selector with one feature for automated pipelines. #566

[ENH] Possibility to fit base_selector with one feature for automated pipelines. #566

Comments

MatheusHam commented Nov 23, 2022 • edited Loading

solegalli commented Nov 23, 2022

MatheusHam commented Nov 24, 2022

solegalli commented Nov 24, 2022 • edited Loading

MatheusHam commented Nov 26, 2022

solegalli commented Nov 28, 2022

MatheusHam commented Nov 23, 2022 •

edited

Loading

solegalli commented Nov 24, 2022 •

edited

Loading