Fixing issues with imbalanced datasets #197

ArlindKadra · 2021-05-04T17:58:07Z

This PR fixes issue #195

…ed because X was a numpy array at the time of checking

ravinkohli

Hey Arlind, thanks for the PR. Its essential in the corner case when we have all nans in train but not in test and vice versa. But, could you add tests for the validator to ensure we can handle the case properly. It would be great if we could test the two cases- 1. where we have all NaNs in train and not in test and 2. where we have all NaNs in test but not in train

ArlindKadra · 2021-05-05T09:03:26Z

Hey Arlind, thanks for the PR. Its essential in the corner case when we have all nans in train but not in test and vice versa. But, could you add tests for the validator to ensure we can handle the case properly. It would be great if we could test the two cases- 1. where we have all NaNs in train and not in test and 2. where we have all NaNs in test but not in train

Yep, I will do just that today. But overall, I wanted to get a quick idea on how the problem is handled now, in case I missed something on the fix.

… to check the implementation

…what happens at the validator, so the types do not change

ravinkohli · 2021-05-07T11:17:33Z

autoPyTorch/data/tabular_feature_validator.py

+            # for the test set, if we have columns with only null values
+            # they will probably have a numeric type. If these columns were not
+            # with only null values in the train set, they should be converted
+            # to the category type.


Suggested change

# to the category type.

# to the type it had while fitting.

I don't think we should wait for the tests to merge in case you choose to incorporate this suggestion

Yep, we can also merge it.

I also changed the comment accordingly.

ravinkohli · 2021-05-07T11:18:45Z

autoPyTorch/data/tabular_feature_validator.py

@@ -414,7 +422,7 @@ def infer_objects(self, X: pd.DataFrame) -> pd.DataFrame:
                    # In the case train data was interpreted as int
                    # and test data was interpreted as float, because of 0.0
                    # for example, honor training data
-                    X[key] = X[key].applymap(np.int64)
+                    X[key] = X[key].astype(np.int64)


should we remove the if-else statement? as we are treating them equally

Yep, nice catch.

ravinkohli · 2021-05-07T11:53:51Z

test/test_data/test_feature_validator.py

+        if transformed_X_test[column].isna().all():
+            null_columns.append(column)
+
+    assert null_columns == [1]


why is null_columns not empty here?

Because for the test set, column 'A' will have only null values, however, it will have a numeric type. While column 'C' will have the 1 as a value through the entire column as it will detect correctly that it has 2 categories from the train set.

The imputer is not used here for numerical values.

I see, sorry I got confused as I expected none to be Null, I don't know why

* adding missing method from base_feature_validator * First try at a fix, removing redundant code * Fix bug * Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking * Fixing flake 8 failing * Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation * flake8 fix * Bug fix * Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change * flake8 fix * Addressing Ravin's comments

ArlindKadra added 5 commits May 4, 2021 19:57

adding missing method from base_feature_validator

b035bb7

First try at a fix, removing redundant code

ef6db27

Fix bug

2d34f11

Updating unit test typo, fixing bug where the data type was not check…

098d4dc

…ed because X was a numpy array at the time of checking

Fixing flake 8 failing

24c2f6f

ArlindKadra requested review from franchuterivera and ravinkohli May 5, 2021 00:20

ravinkohli suggested changes May 5, 2021

View reviewed changes

ArlindKadra added 2 commits May 5, 2021 23:05

Bug fix, implementation update for imbalanced datasets and unit tests…

a43a115

… to check the implementation

flake8 fix

f5c916c

nabenabe0928 self-assigned this May 6, 2021

ArlindKadra added 3 commits May 6, 2021 21:01

Bug fix

da48086

Making the conversion to dataframe in the unit tests consistent with …

aa12c21

…what happens at the validator, so the types do not change

flake8 fix

979e2f3

ArlindKadra requested a review from ravinkohli May 7, 2021 10:11

ArlindKadra unassigned nabenabe0928 May 7, 2021

ravinkohli reviewed May 7, 2021

View reviewed changes

Addressing Ravin's comments

95de9b7

ravinkohli reviewed May 7, 2021

View reviewed changes

ravinkohli approved these changes May 7, 2021

View reviewed changes

ravinkohli merged commit b0b67ea into refactor_development_regularization_cocktails May 7, 2021

github-actions bot pushed a commit that referenced this pull request May 7, 2021

Arlind Kadra: Fixing issues with imbalanced datasets (#197)

b951cc4

ravinkohli deleted the fix#195 branch August 9, 2022 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing issues with imbalanced datasets #197

Fixing issues with imbalanced datasets #197

ArlindKadra commented May 4, 2021

ravinkohli left a comment

ArlindKadra commented May 5, 2021

ravinkohli May 7, 2021

ArlindKadra May 7, 2021

ArlindKadra May 7, 2021

ravinkohli May 7, 2021

ArlindKadra May 7, 2021

ravinkohli May 7, 2021

ArlindKadra May 7, 2021 •

edited

Loading

ArlindKadra May 7, 2021

ravinkohli May 7, 2021

Fixing issues with imbalanced datasets #197

Fixing issues with imbalanced datasets #197

Conversation

ArlindKadra commented May 4, 2021

ravinkohli left a comment

Choose a reason for hiding this comment

ArlindKadra commented May 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArlindKadra May 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArlindKadra May 7, 2021 •

edited

Loading