-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing issues with imbalanced datasets #197
Fixing issues with imbalanced datasets #197
Conversation
…ed because X was a numpy array at the time of checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Arlind, thanks for the PR. Its essential in the corner case when we have all nans in train but not in test and vice versa. But, could you add tests for the validator to ensure we can handle the case properly. It would be great if we could test the two cases- 1. where we have all NaNs in train and not in test and 2. where we have all NaNs in test but not in train
Yep, I will do just that today. But overall, I wanted to get a quick idea on how the problem is handled now, in case I missed something on the fix. |
… to check the implementation
…what happens at the validator, so the types do not change
# for the test set, if we have columns with only null values | ||
# they will probably have a numeric type. If these columns were not | ||
# with only null values in the train set, they should be converted | ||
# to the category type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# to the category type. | |
# to the type it had while fitting. |
I don't think we should wait for the tests to merge in case you choose to incorporate this suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we can also merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also changed the comment accordingly.
@@ -414,7 +422,7 @@ def infer_objects(self, X: pd.DataFrame) -> pd.DataFrame: | |||
# In the case train data was interpreted as int | |||
# and test data was interpreted as float, because of 0.0 | |||
# for example, honor training data | |||
X[key] = X[key].applymap(np.int64) | |||
X[key] = X[key].astype(np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we remove the if-else statement? as we are treating them equally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, nice catch.
if transformed_X_test[column].isna().all(): | ||
null_columns.append(column) | ||
|
||
assert null_columns == [1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is null_columns not empty here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because for the test set, column 'A' will have only null values, however, it will have a numeric type. While column 'C' will have the 1 as a value through the entire column as it will detect correctly that it has 2 categories from the train set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The imputer is not used here for numerical values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, sorry I got confused as I expected none to be Null, I don't know why
* adding missing method from base_feature_validator * First try at a fix, removing redundant code * Fix bug * Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking * Fixing flake 8 failing * Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation * flake8 fix * Bug fix * Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change * flake8 fix * Addressing Ravin's comments
* adding missing method from base_feature_validator * First try at a fix, removing redundant code * Fix bug * Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking * Fixing flake 8 failing * Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation * flake8 fix * Bug fix * Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change * flake8 fix * Addressing Ravin's comments
* adding missing method from base_feature_validator * First try at a fix, removing redundant code * Fix bug * Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking * Fixing flake 8 failing * Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation * flake8 fix * Bug fix * Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change * flake8 fix * Addressing Ravin's comments
* adding missing method from base_feature_validator * First try at a fix, removing redundant code * Fix bug * Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking * Fixing flake 8 failing * Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation * flake8 fix * Bug fix * Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change * flake8 fix * Addressing Ravin's comments
* adding missing method from base_feature_validator * First try at a fix, removing redundant code * Fix bug * Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking * Fixing flake 8 failing * Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation * flake8 fix * Bug fix * Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change * flake8 fix * Addressing Ravin's comments
* adding missing method from base_feature_validator * First try at a fix, removing redundant code * Fix bug * Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking * Fixing flake 8 failing * Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation * flake8 fix * Bug fix * Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change * flake8 fix * Addressing Ravin's comments
* adding missing method from base_feature_validator * First try at a fix, removing redundant code * Fix bug * Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking * Fixing flake 8 failing * Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation * flake8 fix * Bug fix * Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change * flake8 fix * Addressing Ravin's comments
* adding missing method from base_feature_validator * First try at a fix, removing redundant code * Fix bug * Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking * Fixing flake 8 failing * Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation * flake8 fix * Bug fix * Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change * flake8 fix * Addressing Ravin's comments
* adding missing method from base_feature_validator * First try at a fix, removing redundant code * Fix bug * Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking * Fixing flake 8 failing * Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation * flake8 fix * Bug fix * Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change * flake8 fix * Addressing Ravin's comments
* adding missing method from base_feature_validator * First try at a fix, removing redundant code * Fix bug * Updating unit test typo, fixing bug where the data type was not checked because X was a numpy array at the time of checking * Fixing flake 8 failing * Bug fix, implementation update for imbalanced datasets and unit tests to check the implementation * flake8 fix * Bug fix * Making the conversion to dataframe in the unit tests consistent with what happens at the validator, so the types do not change * flake8 fix * Addressing Ravin's comments
This PR fixes issue #195