Add missing_only init param to all imputers. missing_only is used in combination w/ the variables param. #698

Morgan-Sell · 2023-09-18T13:56:24Z

Closes #388.

When missing_only=True and variables=None, all numerical variables that do not contain missing values will be omitted from self.variables_.

This functionality does not apply to categorical variables.

…passed.

…imputation __init__.py

…riately revise unit test

Morgan-Sell · 2023-09-22T01:49:22Z

hi @solegalli,

When missing_only=True and variables=None, how should this impact the transformed dataset after transform() is applied? Should the transform() method only return those variables that were identified as having missing values? Or, should transform() return the complete dataset?

I'm guessing the latter because I can't think of a scenario in which we would want to eliminate variables that do not have missing values.

If my intuition is correct, is the sole purpose of this new feature for computational efficiency? If not, what are the other objectives of this new feature?

Morgan-Sell · 2023-09-29T19:42:44Z

hola @solegalli, are you on vacation? Just wanted to see if you saw my question. No rush! I'm accustomed to your rapid responses, so I thought I check-in.

solegalli · 2023-10-10T14:41:19Z

Hey @Morgan-Sell
I was indeed on vacation. But the most funny thing is that I did see the question before, and I thought I answered it. LOL. Sorry about that.

transform() in feature engine always returns the complete dataset, unless the transformation adds variables, in which case it would return the complete dataframe plus the new variables.

The aim is not to expand the feature space unnecessarily. If you add indicators for all variables, not just those with NA, you will end up with a lot of variables that just contain the value 0. Is this what you mean by efficiency?

Cheers
Sole

Morgan-Sell · 2023-10-10T22:45:09Z

@solegalli, welcome back! Hope you enjoyed your "holiday" as they say on your side of the pond ;)

Based on your explanation, it seems that the missing_only param solely applies to the AddMissingIndicator class. The other imputers do not add additional variables. They fill in the missing values of the existing variables with the desired value(s).

If this is the case, then should this functionality only reside in the AddMissingIndicator class?

solegalli · 2023-10-11T07:29:26Z

Its most meaningful / useful function is indeed for the MissingIndicator class. But we should add it in all imputers. It will play a role when variables =None. At the moment, when variables=None the transformers will learn imputation parameters for all numerical or categorical variables. If we add missing_only=True, they will learn parameters only for those numerical or categorical with NA. More efficient, smaller dictionaries stored, and avoids unexpected imputations when the model is live.

…ests pass

…puter class

…er class. Add missing_only param to CategoricalEncoder __init__. All CategoricalEncoder tests pass.

…nsformed

Morgan-Sell · 2023-10-12T20:28:31Z

@solegalli,

I added missing_value, its functionality, and appropriate tests to the ArbitraryNumberImputer and CategorialEncoder classes. Before implementing the changes in all imputers, will you please validate or nullify what I've done thus far?

…nsformed

feature_engine/_docstrings/init_parameters/imputers.py

feature_engine/imputation/__init__.py

feature_engine/imputation/arbitrary_number.py

feature_engine/imputation/base_imputer.py

solegalli · 2023-10-19T11:38:05Z

feature_engine/imputation/categorical.py

@@ -175,6 +185,10 @@ def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
            # select all variables or check variables entered by the user
            self.variables_ = find_all_variables(X, self.variables)

+        # identify variables with missing values
+        if self.missing_only and self.variables is None:


same as previous, we need to combine this with find_all_variables so it doesnt search twice.

See my prior response. By implementing the elif approach, we'll encounter a similar issue. This time we only want categorical variables, but if we create a new elif then we'll have both categorical and numerical variables

solegalli

Hey @Morgan-Sell !

Thank you so much for the changes. This is well ahead in the development :)

I made a few code suggestions and added a few comments, so here I summarize the main things:

for numerical imputers, when variables=None and missing=True, it selects all numerical variables with nan
for categorical imputers, when variables=None and missing=True, it selects all categorical variables with nan
for the transformer that adds missing indicators, or the random sample imputer, when variables=None and missing=True, it selects all all variables with nan

I think the function that you created needs a bit of refinement.

Morgan-Sell · 2023-10-31T21:32:52Z

hi @solegalli,

I responded to two of your comments. Once we agree on the approach, we will quickly move forward with this PR. It's, more or less, copying/pasting the agreed-upon approach on all transformers.

…es_with_missing_values().

…on between 'ignore_format' and 'missing_only'

Morgan-Sell · 2023-11-07T14:03:18Z

hi @solegalli, i created two global functions to identify either categorical or numerical variables that have missing values. These functions incorporate the "check_or_find_all..." functions.

I have one question about the ArbitraryImputer's fit() order of operations. See my above comment.

I also saw that DropMissingData() already incorporates the missing_value param. Should we keep it as is? Or, should we change to match the style of the other imputers, e.g., transformer imports/applies find_all_variables_with_missing_values().

In the meantime, I will implement the new code in the other imputation transformers.

…ts passed.

…tests passed.

Morgan-Sell · 2023-11-07T23:25:32Z

hi @solegalli,

AddMissingIndicator() already has the missing_only param. I don't recall coding it ;)

Should I refactor the code so the AddMissingIndicator() code base has a similar structure to the other imputers? Or, keep the code as is?

If you have time, take a look at the other imputers. We're almost there!!!

Morgan-Sell · 2023-11-15T14:33:53Z

hi @solegalli,

Did you see my above questions regarding the AddMissingIndicator() class?

Morgan-Sell added 5 commits September 18, 2023 09:51

initial commit. add 'missing_only' to BaseImputer init

16abaaf

create find_variables_with_missing_values() and unit test. unit test …

13a554f

…passed.

create unit test for BaseImputer missing_only and add BaseImputer to …

2c0d80b

…imputation __init__.py

add 'variables' param to find_varibles_with_missing_vlaues and approp…

99a1567

…riately revise unit test

add find_variables_with_missing_values to ArbitraryNumberImputer

25726fb

Morgan-Sell added 4 commits October 11, 2023 14:12

create test_transformation_when_missing_only_is_true. all arbitrary t…

d8d8516

…ests pass

add find_variables_with_missing_values and docstring to CategoricalIm…

495645d

…puter class

add test_transformation_when_missing_only_is_true to CategoricalEncod…

e8939f4

…er class. Add missing_only param to CategoricalEncoder __init__. All CategoricalEncoder tests pass.

add test to ensure variables without missing values are not saved/tra…

c49c720

…nsformed

add test to ensure variables without missing values are not saved/tra…

39123e5

…nsformed