Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing_only init param to all imputers. missing_only is used in combination w/ the variables param. #698

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

Morgan-Sell
Copy link
Collaborator

Closes #388.

When missing_only=True and variables=None, all numerical variables that do not contain missing values will be omitted from self.variables_.

This functionality does not apply to categorical variables.

@Morgan-Sell
Copy link
Collaborator Author

hi @solegalli,

When missing_only=True and variables=None, how should this impact the transformed dataset after transform() is applied? Should the transform() method only return those variables that were identified as having missing values? Or, should transform() return the complete dataset?

I'm guessing the latter because I can't think of a scenario in which we would want to eliminate variables that do not have missing values.

If my intuition is correct, is the sole purpose of this new feature for computational efficiency? If not, what are the other objectives of this new feature?

@Morgan-Sell
Copy link
Collaborator Author

hola @solegalli, are you on vacation? Just wanted to see if you saw my question. No rush! I'm accustomed to your rapid responses, so I thought I check-in.

@solegalli
Copy link
Collaborator

Hey @Morgan-Sell
I was indeed on vacation. But the most funny thing is that I did see the question before, and I thought I answered it. LOL. Sorry about that.

transform() in feature engine always returns the complete dataset, unless the transformation adds variables, in which case it would return the complete dataframe plus the new variables.

The aim is not to expand the feature space unnecessarily. If you add indicators for all variables, not just those with NA, you will end up with a lot of variables that just contain the value 0. Is this what you mean by efficiency?

Cheers
Sole

@Morgan-Sell
Copy link
Collaborator Author

@solegalli, welcome back! Hope you enjoyed your "holiday" as they say on your side of the pond ;)

Based on your explanation, it seems that the missing_only param solely applies to the AddMissingIndicator class. The other imputers do not add additional variables. They fill in the missing values of the existing variables with the desired value(s).

If this is the case, then should this functionality only reside in the AddMissingIndicator class?

@solegalli
Copy link
Collaborator

Its most meaningful / useful function is indeed for the MissingIndicator class. But we should add it in all imputers. It will play a role when variables =None. At the moment, when variables=None the transformers will learn imputation parameters for all numerical or categorical variables. If we add missing_only=True, they will learn parameters only for those numerical or categorical with NA. More efficient, smaller dictionaries stored, and avoids unexpected imputations when the model is live.

@Morgan-Sell
Copy link
Collaborator Author

@solegalli,

I added missing_value, its functionality, and appropriate tests to the ArbitraryNumberImputer and CategorialEncoder classes. Before implementing the changes in all imputers, will you please validate or nullify what I've done thus far?

@@ -175,6 +185,10 @@ def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
# select all variables or check variables entered by the user
self.variables_ = find_all_variables(X, self.variables)

# identify variables with missing values
if self.missing_only and self.variables is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous, we need to combine this with find_all_variables so it doesnt search twice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my prior response. By implementing the elif approach, we'll encounter a similar issue. This time we only want categorical variables, but if we create a new elif then we'll have both categorical and numerical variables

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Morgan-Sell !

Thank you so much for the changes. This is well ahead in the development :)

I made a few code suggestions and added a few comments, so here I summarize the main things:

  • for numerical imputers, when variables=None and missing=True, it selects all numerical variables with nan
  • for categorical imputers, when variables=None and missing=True, it selects all categorical variables with nan
  • for the transformer that adds missing indicators, or the random sample imputer, when variables=None and missing=True, it selects all all variables with nan

I think the function that you created needs a bit of refinement.

@Morgan-Sell
Copy link
Collaborator Author

hi @solegalli,

I responded to two of your comments. Once we agree on the approach, we will quickly move forward with this PR. It's, more or less, copying/pasting the agreed-upon approach on all transformers.

@Morgan-Sell
Copy link
Collaborator Author

Morgan-Sell commented Nov 7, 2023

hi @solegalli, i created two global functions to identify either categorical or numerical variables that have missing values. These functions incorporate the "check_or_find_all..." functions.

I have one question about the ArbitraryImputer's fit() order of operations. See my above comment.

I also saw that DropMissingData() already incorporates the missing_value param. Should we keep it as is? Or, should we change to match the style of the other imputers, e.g., transformer imports/applies find_all_variables_with_missing_values().

In the meantime, I will implement the new code in the other imputation transformers.

@Morgan-Sell
Copy link
Collaborator Author

hi @solegalli,

AddMissingIndicator() already has the missing_only param. I don't recall coding it ;)

Should I refactor the code so the AddMissingIndicator() code base has a similar structure to the other imputers? Or, keep the code as is?

If you have time, take a look at the other imputers. We're almost there!!!

@Morgan-Sell
Copy link
Collaborator Author

hi @solegalli,

Did you see my above questions regarding the AddMissingIndicator() class?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add missing_only=True to all imputers to use in combination with variables=None
2 participants