Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Possibility to fit base_selector with one feature for automated pipelines. #566

Closed
MatheusHam opened this issue Nov 23, 2022 · 5 comments

Comments

@MatheusHam
Copy link

MatheusHam commented Nov 23, 2022

def _check_variable_number(self) -> None:
"""Check that there are multiple variables for the selectors to work with."""
if len(self.variables_) < 2:
raise ValueError(
"The selector needs at least 2 or more variables to select from. "
f"Got only 1 variable: {self.variables_}."

When using SuperVectorizer (dirty_cat) to develop an AutoML solution, I have bumped into a situation where a master table has only one numerical feature, thus raising an error here. In this case I was using specifically DropDuplicateFeatures and wanted the process to move on with the only feature given.

Proposal:

Add a parameter to base_selector that could prevent such raises for automated pipelines.

@solegalli
Copy link
Collaborator

Hi @MatheusHam

Thanks for your engagement in the project!

I don't understand the issue though. If you want to remove duplicated features from a dataset (which is what DropDuplicateFeatures is designed to do), you need at least a data set with 2 features. If the dataset has 1 feature, for sure, it won't be duplicated. So why would we use DropDuplicateFeatures on a dataframe with 1 variable?

Maybe there is something from your process that I am missing?. Maybe you could explain a bit more the end to end process? or link / add some code / examples?

Thank you!

@MatheusHam
Copy link
Author

Absolutelly!

I have been working on an auto ml solution and we are applying a SuperVectorizer with some feature_engine selectors. The SuperVectorizer (dirty_cat) separates the features of our master by its types and applies a transformation pipeline, e.g:

    def get_vectorizer(
        self,
        datetime_pipeline: Encoders = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            DatetimeFeatures(
                missing_values="ignore", features_to_extract=DATETIME_FEATURES
            ),
        ),
        low_card_transformer: Encoders = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            SimpleImputer(
                missing_values=np.nan,
                add_indicator=True,
                strategy="constant",
                fill_value="missing",
            ),
            OrdinalEncoder(
                handle_unknown="use_encoded_value", unknown_value=-1
            ),
        ),
        high_card_transformer: Encoders = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            GapEncoder(hashing=True, random_state=RANDOM_STATE),
        ),
        numerical_transformer: NumTransformer = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            DropDuplicateFeatures(),
            SimpleImputer(
                missing_values=np.nan, add_indicator=True, strategy="median"
            ),
        ),
    ) -> Pipeline:
   
        return SuperVectorizer(
            auto_cast=True,
            n_jobs=N_CORES,
            low_card_cat_transformer=low_card_transformer,
            high_card_cat_transformer=high_card_transformer,
            numerical_transformer=numerical_transformer,
            datetime_transformer=datetime_pipeline,
            impute_missing="force",
            remainder="drop",
        )

Sometimes, a user tries to use our automated pipeline with a master_table that contains only one numerical feature thus raising the error mentioned above.

The idea is that for such automated cases when there is no selection to be done, like the example given above, the pipeline would continue with the single feature instead of raising. The solution could be a boolean parameter to continue execution in such cases.

Let me know if this makes sense. I appreciate your response!

@solegalli
Copy link
Collaborator

solegalli commented Nov 24, 2022

Cool example, thank you! Very interesting way of combining transformers :)

My first thought is, if we remote that check in the base_selector, it will affect other transformers as well, so we need to think carefully about the consequences of doing this.

Looking at your pipeline, I guess you pass different datasets (containing datetime vars, low cardinality vars, high cardinality vars and numerical vars) to each one of the params from get_vectorizer(), is this correct?

If this is the case, could you not add a switch to the pipeline to check for the length of the list with numerical variables, and if it is 1, then remove the selector from the pipeline?

Something like this:

    def get_vectorizer(
        self,
        datetime_pipeline: Encoders = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            DatetimeFeatures(
                missing_values="ignore", features_to_extract=DATETIME_FEATURES
            ),
        ),
        low_card_transformer: Encoders = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            SimpleImputer(
                missing_values=np.nan,
                add_indicator=True,
                strategy="constant",
                fill_value="missing",
            ),
            OrdinalEncoder(
                handle_unknown="use_encoded_value", unknown_value=-1
            ),
        ),
        high_card_transformer: Encoders = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            GapEncoder(hashing=True, random_state=RANDOM_STATE),
        ),
        numerical_transformer: NumTransformer = make_pipeline(
            DropConstantFeatures(tol=0.998, missing_values="ignore"),
            DropDuplicateFeatures(),
            SimpleImputer(
                missing_values=np.nan, add_indicator=True, strategy="median"
            ),
        switch=False,
        ),
    ) -> Pipeline:

         if switch is True:
              NumTransformer = make_pipeline(
                DropConstantFeatures(tol=0.998, missing_values="ignore"),
                SimpleImputer(
                missing_values=np.nan, add_indicator=True, strategy="median"
            ),

        return SuperVectorizer(
            auto_cast=True,
            n_jobs=N_CORES,
            low_card_cat_transformer=low_card_transformer,
            high_card_cat_transformer=high_card_transformer,
            numerical_transformer=numerical_transformer,
            datetime_transformer=datetime_pipeline,
            impute_missing="force",
            remainder="drop",
        )

@MatheusHam
Copy link
Author

We do something similar to what you suggested, I appreciate your answer.

Yes, that would definitely require special attention. Some data analysts with little python experience are using our auto ml tool to experiment on their own, eventually, they can fail to build a good first master table for their use case and we do not want to impact their experience by raising errors they can't understand.

I could argue that if we are trying to apply specifically DropDuplicatedFeatures() and the given DataFrame contains just one column, then it should just move on with it because there are no duplicates. Even though I understand why it is reasonable to prevent this attempt, an option for the use I described does not seem so absurd.

The reason I suggested something on the base_selector (not changing the way it works right now, just adding an option for such checks that break pipelines)

If you still think it does not make sense, feel free to close it! I'll be back with contributions soon! 👍

@solegalli
Copy link
Collaborator

Hi @MatheusHam

Thanks for the input. I really appreciate it.

As it stands now, I get the impression that this would not be a wide spread enough use of the transformer for us to change the API.

We could pick it up at a later stage if there are more users that would like DropDuplicatedFetures() and / or other selectors not to drop the error on single column dataframes.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants