Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing issues with imbalanced datasets #197

Merged
merged 11 commits into from
May 7, 2021

Conversation

ArlindKadra
Copy link

This PR fixes issue #195

Copy link
Contributor

@ravinkohli ravinkohli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Arlind, thanks for the PR. Its essential in the corner case when we have all nans in train but not in test and vice versa. But, could you add tests for the validator to ensure we can handle the case properly. It would be great if we could test the two cases- 1. where we have all NaNs in train and not in test and 2. where we have all NaNs in test but not in train

@ArlindKadra
Copy link
Author

Hey Arlind, thanks for the PR. Its essential in the corner case when we have all nans in train but not in test and vice versa. But, could you add tests for the validator to ensure we can handle the case properly. It would be great if we could test the two cases- 1. where we have all NaNs in train and not in test and 2. where we have all NaNs in test but not in train

Yep, I will do just that today. But overall, I wanted to get a quick idea on how the problem is handled now, in case I missed something on the fix.

@nabenabe0928 nabenabe0928 self-assigned this May 6, 2021
# for the test set, if we have columns with only null values
# they will probably have a numeric type. If these columns were not
# with only null values in the train set, they should be converted
# to the category type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# to the category type.
# to the type it had while fitting.

I don't think we should wait for the tests to merge in case you choose to incorporate this suggestion

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we can also merge it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also changed the comment accordingly.

@@ -414,7 +422,7 @@ def infer_objects(self, X: pd.DataFrame) -> pd.DataFrame:
# In the case train data was interpreted as int
# and test data was interpreted as float, because of 0.0
# for example, honor training data
X[key] = X[key].applymap(np.int64)
X[key] = X[key].astype(np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove the if-else statement? as we are treating them equally

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, nice catch.

if transformed_X_test[column].isna().all():
null_columns.append(column)

assert null_columns == [1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is null_columns not empty here?

Copy link
Author

@ArlindKadra ArlindKadra May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because for the test set, column 'A' will have only null values, however, it will have a numeric type. While column 'C' will have the 1 as a value through the entire column as it will detect correctly that it has 2 categories from the train set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imputer is not used here for numerical values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, sorry I got confused as I expected none to be Null, I don't know why

@ravinkohli ravinkohli merged commit b0b67ea into refactor_development_regularization_cocktails May 7, 2021
ravinkohli pushed a commit that referenced this pull request Dec 8, 2021
* adding missing method from base_feature_validator

* First try at a fix, removing redundant code

* Fix bug

* Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking

* Fixing flake 8 failing

* Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation

* flake8 fix

* Bug fix

* Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change

* flake8 fix

* Addressing Ravin's comments
ravinkohli pushed a commit that referenced this pull request Dec 8, 2021
* adding missing method from base_feature_validator

* First try at a fix, removing redundant code

* Fix bug

* Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking

* Fixing flake 8 failing

* Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation

* flake8 fix

* Bug fix

* Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change

* flake8 fix

* Addressing Ravin's comments
ravinkohli pushed a commit that referenced this pull request Dec 21, 2021
* adding missing method from base_feature_validator

* First try at a fix, removing redundant code

* Fix bug

* Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking

* Fixing flake 8 failing

* Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation

* flake8 fix

* Bug fix

* Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change

* flake8 fix

* Addressing Ravin's comments
ravinkohli pushed a commit that referenced this pull request Jan 24, 2022
* adding missing method from base_feature_validator

* First try at a fix, removing redundant code

* Fix bug

* Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking

* Fixing flake 8 failing

* Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation

* flake8 fix

* Bug fix

* Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change

* flake8 fix

* Addressing Ravin's comments
ravinkohli pushed a commit that referenced this pull request Jan 28, 2022
* adding missing method from base_feature_validator

* First try at a fix, removing redundant code

* Fix bug

* Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking

* Fixing flake 8 failing

* Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation

* flake8 fix

* Bug fix

* Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change

* flake8 fix

* Addressing Ravin's comments
ravinkohli pushed a commit that referenced this pull request Feb 28, 2022
* adding missing method from base_feature_validator

* First try at a fix, removing redundant code

* Fix bug

* Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking

* Fixing flake 8 failing

* Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation

* flake8 fix

* Bug fix

* Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change

* flake8 fix

* Addressing Ravin's comments
ravinkohli pushed a commit that referenced this pull request Feb 28, 2022
* adding missing method from base_feature_validator

* First try at a fix, removing redundant code

* Fix bug

* Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking

* Fixing flake 8 failing

* Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation

* flake8 fix

* Bug fix

* Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change

* flake8 fix

* Addressing Ravin's comments
ravinkohli pushed a commit that referenced this pull request Mar 9, 2022
* adding missing method from base_feature_validator

* First try at a fix, removing redundant code

* Fix bug

* Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking

* Fixing flake 8 failing

* Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation

* flake8 fix

* Bug fix

* Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change

* flake8 fix

* Addressing Ravin's comments
ravinkohli pushed a commit to ravinkohli/Auto-PyTorch that referenced this pull request Apr 12, 2022
* adding missing method from base_feature_validator

* First try at a fix, removing redundant code

* Fix bug

* Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking

* Fixing flake 8 failing

* Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation

* flake8 fix

* Bug fix

* Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change

* flake8 fix

* Addressing Ravin's comments
ravinkohli pushed a commit that referenced this pull request Jul 26, 2022
* adding missing method from base_feature_validator

* First try at a fix, removing redundant code

* Fix bug

* Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking

* Fixing flake 8 failing

* Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation

* flake8 fix

* Bug fix

* Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change

* flake8 fix

* Addressing Ravin's comments
@ravinkohli ravinkohli deleted the fix#195 branch August 9, 2022 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants