Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DropNullColumn transformer to remove columns that contain only nulls #1115

Open
wants to merge 52 commits into
base: main
Choose a base branch
from

Conversation

rcap107
Copy link
Contributor

@rcap107 rcap107 commented Oct 17, 2024

fixes #1110

DropNullColumn (provisional name) takes as input a column, and drops it if all the values are nulls or nans. TableVectorizer was also updated with a drop_null_columns flag set to False by default; if the flag is set to True, the DropNullColumn is added as a processing step for all columns.

I've also added drop and is_all_null to _common.py, though I am not sure if they should go there. Maybe is_all_null can stay in the DropNullColumn file.

The test I wrote passes, but I'm not sure if it's good enough.

The documentation is still missing.

Copy link
Contributor

@TheooJ TheooJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rcap107 ! I made a first pass and have a few comments :

  • Personnally I like the name DropNullColumn, I think it’s clear what it does !
  • I would rename the file _drop_null.py
  • Make sure you pre-commit run --all-files before pushing, it seems to be what’s breaking the CI for you here
  • I think is_all_null could be placed in the DropNullColumn file if it’s only used there for now, but I could also see it being in _common.py

skrub/tests/test_dropnulls.py Outdated Show resolved Hide resolved
skrub/tests/test_dropnulls.py Outdated Show resolved Hide resolved
skrub/_dataframe/_common.py Show resolved Hide resolved
skrub/_dataframe/_common.py Outdated Show resolved Hide resolved
@@ -1187,3 +1208,15 @@ def with_columns(df, **new_cols):
cols = {col_name: col(df, col_name) for col_name in column_names(df)}
cols.update({n: make_column_like(df, c, n) for n, c in new_cols.items()})
return make_dataframe_like(df, cols)

@dispatch
def drop(obj, col):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t know if drop is necessary, you could directly use skrub selectors:
df = s.select(df, ~s.cols(col))

@@ -191,6 +192,9 @@ class TableVectorizer(TransformerMixin, BaseEstimator):
similar functionality to what is offered by scikit-learn's
:class:`~sklearn.compose.ColumnTransformer`.

drop_null_columns : bool, default=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want it to be True by default ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be discussed with others I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for true by default -- there's nothing we can learn from a completely empty column.

if it is False by default, I think it should be set to True in the tabular_learner

skrub/tests/test_dropnulls.py Outdated Show resolved Hide resolved
main_table_dropped = ns.drop(main_table_dropped, "value_nan")

# Don't drop null columns
tv = TableVectorizer(drop_null_columns=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test needs to go in the TV test file IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move it 👍

@rcap107
Copy link
Contributor Author

rcap107 commented Oct 21, 2024

Hi @TheooJ, thanks a lot for the comments! I'll address them and update the PR 👍


# assert_array_equal(
# sbd.to_numpy(sbd.col(drop_null_table, "value_almost_null")),
# np.array(["almost", None, None]),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to write this check so that it works with either pandas or polars

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use df_module as a fixture in the test by adding it to the arguments, then comparing series instead of numpy arrays:

df_module.assert_column_equal(
    sbd.col(drop_null_table, "value_almost_null"),
    df_module.make_column("value_almost_null", ["almost", None, None]),
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test would look like

def test_single_column(drop_null_table, df_module):
    """Check that null columns are dropped and non-null columns are kept."""
    dn = DropNullColumn()
    assert dn.fit_transform(drop_null_table["value_nan"]) == []
    assert dn.fit_transform(drop_null_table["value_null"]) == []

    df_module.assert_column_equal(
        sbd.col(drop_null_table, "idx"), df_module.make_column("idx", [1, 2, 3])
    )

    df_module.assert_column_equal(
        sbd.col(drop_null_table, "value_almost_nan"),
        df_module.make_column("value_almost_nan", [2.5, np.nan, np.nan]),
    )

    df_module.assert_column_equal(
        sbd.col(drop_null_table, "value_almost_null"),
        df_module.make_column("value_almost_null", ["almost", None, None]),
    )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also circumvents that depending on the version of pandas, null values are not treated the same

@jeromedockes
Copy link
Member

the failure in the min-deps environment is not related to this pr; the fix is in #1122

skrub/_dataframe/_common.py Outdated Show resolved Hide resolved
skrub/_dataframe/_common.py Show resolved Hide resolved
skrub/_drop_null.py Outdated Show resolved Hide resolved
skrub/_drop_null.py Outdated Show resolved Hide resolved
@@ -536,6 +542,9 @@ def add_step(steps, transformer, cols, allow_reject=False):
cols = s.all() - self._specific_columns

self._preprocessors = [CheckInputDataFrame()]
if self.drop_null_columns:
add_step(self._preprocessors, DropNullColumn(), cols, allow_reject=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • we may want to insert it after CleanNullStrings? so that if the column becomes full of nulls after converting "N/A" to null it will be dropped. also it's not important but your transformer never raises a RejectColumn exception so allow_reject has no effect you don't need it here and can leave the default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it after CleanNullStrings, but I think I did it in an ugly way, maybe it can be fixed

skrub/_dataframe/_common.py Outdated Show resolved Hide resolved
skrub/_drop_null.py Outdated Show resolved Hide resolved
def _is_all_null_polars(col):
if col.dtype == pl.Null:
return True
# col is non numeric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was a leftover from the previous version, I rewrote the comments

skrub/_dataframe/tests/test_common.py Show resolved Hide resolved
skrub/_drop_null_column.py Outdated Show resolved Hide resolved
Comment on lines 39 to 41
check_is_fitted(
self,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
check_is_fitted(
self,
)
check_is_fitted(self)

skrub/_drop_null_column.py Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
skrub/tests/test_drop_null_column.py Outdated Show resolved Hide resolved
skrub/tests/test_drop_null_column.py Outdated Show resolved Hide resolved
skrub/tests/test_drop_null_column.py Outdated Show resolved Hide resolved
skrub/tests/test_drop_null_column.py Outdated Show resolved Hide resolved
@@ -506,8 +528,11 @@ def test_changing_types(X_train, X_test, expected_X_out):
"""
table_vec = TableVectorizer(
# only extract the total seconds
datetime=DatetimeEncoder(resolution=None)
datetime=DatetimeEncoder(resolution=None),
# True by default
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set this to false to keep the original behavior with no DropNullColumns. Given that the default value is True, should I change the test so that the "default behavior" is what is tested here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok the way you did it

@rcap107 rcap107 marked this pull request as ready for review October 24, 2024 09:56
@@ -389,7 +394,7 @@ class TableVectorizer(TransformerMixin, BaseEstimator):
``ToDatetime()``:

>>> vectorizer.all_processing_steps_
{'A': [Drop()], 'B': [OrdinalEncoder()], 'C': [CleanNullStrings(), ToFloat32(), PassThrough(), {'C': ToFloat32()}]}
{'A': [Drop()], 'B': [OrdinalEncoder()], 'C': [CleanNullStrings(), DropNullColumn(), ToFloat32(), PassThrough(), {'C': ToFloat32()}]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have an 'if' in there eg call it 'DropColumnIfNull' to make it clearer that this may be a no-op when reading the list of transformations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, it's definitely clearer. I renamed the object.

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot @rcap107 !! LGTM once the last couple of nitpicks are addressed 🎉

skrub/_dataframe/tests/test_common.py Show resolved Hide resolved
def fit_transform(self, column, y=None):
"""Fit the encoder and transform a column.

Args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check the formatting of the other estimators in the library, here the section header would look like

Parameters
----------

not super important here as it is a private class, but if we decide to make it public it will be important because sphinx relies on this formatting to produce the reference html documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a DropEmpty flag to tabular_learner
3 participants