Added DropNullColumn transformer to remove columns that contain only nulls #1115

rcap107 · 2024-10-17T13:45:29Z

DropNullColumn (provisional name) takes as input a column, and drops it if all the values are nulls or nans. TableVectorizer was also updated with a drop_null_columns flag set to False by default; if the flag is set to True, the DropNullColumn is added as a processing step for all columns.

I've also added drop and is_all_null to _common.py, though I am not sure if they should go there. Maybe is_all_null can stay in the DropNullColumn file.

The test I wrote passes, but I'm not sure if it's good enough.

The documentation is still missing.

TheooJ

Hi @rcap107 ! I made a first pass and have a few comments :

Personnally I like the name DropNullColumn, I think it’s clear what it does !
I would rename the file _drop_null.py
Make sure you pre-commit run --all-files before pushing, it seems to be what’s breaking the CI for you here
I think is_all_null could be placed in the DropNullColumn file if it’s only used there for now, but I could also see it being in _common.py

skrub/tests/test_dropnulls.py

skrub/_dataframe/_common.py

TheooJ · 2024-10-18T15:44:34Z

skrub/_dataframe/_common.py

@@ -1187,3 +1208,15 @@ def with_columns(df, **new_cols):
    cols = {col_name: col(df, col_name) for col_name in column_names(df)}
    cols.update({n: make_column_like(df, c, n) for n, c in new_cols.items()})
    return make_dataframe_like(df, cols)
+
+@dispatch
+def drop(obj, col):


I don’t know if drop is necessary, you could directly use skrub selectors:
df = s.select(df, ~s.cols(col))

TheooJ · 2024-10-18T15:46:34Z

skrub/_table_vectorizer.py

@@ -191,6 +192,9 @@ class TableVectorizer(TransformerMixin, BaseEstimator):
        similar functionality to what is offered by scikit-learn's
        :class:`~sklearn.compose.ColumnTransformer`.

+    drop_null_columns : bool, default=False


Do we want it to be True by default ?

That should be discussed with others I think

I vote for true by default -- there's nothing we can learn from a completely empty column.

if it is False by default, I think it should be set to True in the tabular_learner

skrub/tests/test_dropnulls.py

TheooJ · 2024-10-18T15:51:40Z

skrub/tests/test_dropnulls.py

+    main_table_dropped = ns.drop(main_table_dropped, "value_nan")
+
+    # Don't drop null columns
+    tv = TableVectorizer(drop_null_columns=False)


This test needs to go in the TV test file IMO

I can move it 👍

rcap107 · 2024-10-21T08:15:46Z

Hi @TheooJ, thanks a lot for the comments! I'll address them and update the PR 👍

Co-authored-by: Théo Jolivet <[email protected]>

…into drop_null_columns

rcap107 · 2024-10-21T14:57:24Z

skrub/tests/test_drop_nulls.py

+
+    # assert_array_equal(
+    #     sbd.to_numpy(sbd.col(drop_null_table, "value_almost_null")),
+    #     np.array(["almost", None, None]),


Not sure how to write this check so that it works with either pandas or polars

You could use df_module as a fixture in the test by adding it to the arguments, then comparing series instead of numpy arrays:

df_module.assert_column_equal( sbd.col(drop_null_table, "value_almost_null"), df_module.make_column("value_almost_null", ["almost", None, None]), )

Test would look like

def test_single_column(drop_null_table, df_module): """Check that null columns are dropped and non-null columns are kept.""" dn = DropNullColumn() assert dn.fit_transform(drop_null_table["value_nan"]) == [] assert dn.fit_transform(drop_null_table["value_null"]) == [] df_module.assert_column_equal( sbd.col(drop_null_table, "idx"), df_module.make_column("idx", [1, 2, 3]) ) df_module.assert_column_equal( sbd.col(drop_null_table, "value_almost_nan"), df_module.make_column("value_almost_nan", [2.5, np.nan, np.nan]), ) df_module.assert_column_equal( sbd.col(drop_null_table, "value_almost_null"), df_module.make_column("value_almost_null", ["almost", None, None]), )

This also circumvents that depending on the version of pandas, null values are not treated the same

jeromedockes · 2024-10-22T09:09:46Z

the failure in the min-deps environment is not related to this pr; the fix is in #1122

skrub/_dataframe/_common.py

skrub/_drop_null.py

jeromedockes · 2024-10-22T09:20:17Z

skrub/_table_vectorizer.py

@@ -536,6 +542,9 @@ def add_step(steps, transformer, cols, allow_reject=False):
        cols = s.all() - self._specific_columns

        self._preprocessors = [CheckInputDataFrame()]
+        if self.drop_null_columns:
+            add_step(self._preprocessors, DropNullColumn(), cols, allow_reject=True)


we may want to insert it after CleanNullStrings? so that if the column becomes full of nulls after converting "N/A" to null it will be dropped. also it's not important but your transformer never raises a RejectColumn exception so allow_reject has no effect you don't need it here and can leave the default

I added it after CleanNullStrings, but I think I did it in an ugly way, maybe it can be fixed

skrub/_dataframe/_common.py

skrub/_drop_null.py

skrub/_table_vectorizer.py

jeromedockes · 2024-10-23T12:09:53Z

skrub/_dataframe/_common.py

+def _is_all_null_polars(col):
+    if col.dtype == pl.Null:
+        return True
+    # col is non numeric


not sure I understand the comment

it was a leftover from the previous version, I rewrote the comments

skrub/_dataframe/tests/test_common.py

skrub/_drop_null_column.py

jeromedockes · 2024-10-23T12:14:10Z

skrub/_drop_null_column.py

+        check_is_fitted(
+            self,
+        )


Suggested change

check_is_fitted(

self,

)

check_is_fitted(self)

skrub/_drop_null_column.py

skrub/_table_vectorizer.py

skrub/tests/test_drop_null_column.py

rcap107 · 2024-10-24T09:36:37Z

skrub/tests/test_table_vectorizer.py

@@ -506,8 +528,11 @@ def test_changing_types(X_train, X_test, expected_X_out):
    """
    table_vec = TableVectorizer(
        # only extract the total seconds
-        datetime=DatetimeEncoder(resolution=None)
+        datetime=DatetimeEncoder(resolution=None),
+        # True by default


I set this to false to keep the original behavior with no DropNullColumns. Given that the default value is True, should I change the test so that the "default behavior" is what is tested here?

I think it's ok the way you did it

jeromedockes · 2024-10-24T14:21:41Z

skrub/_table_vectorizer.py

@@ -389,7 +394,7 @@ class TableVectorizer(TransformerMixin, BaseEstimator):
    ``ToDatetime()``:

    >>> vectorizer.all_processing_steps_
-    {'A': [Drop()], 'B': [OrdinalEncoder()], 'C': [CleanNullStrings(), ToFloat32(), PassThrough(), {'C': ToFloat32()}]}
+    {'A': [Drop()], 'B': [OrdinalEncoder()], 'C': [CleanNullStrings(), DropNullColumn(), ToFloat32(), PassThrough(), {'C': ToFloat32()}]}


I wonder if we should have an 'if' in there eg call it 'DropColumnIfNull' to make it clearer that this may be a no-op when reading the list of transformations

I like the idea, it's definitely clearer. I renamed the object.

…into drop_null_columns

jeromedockes

thanks a lot @rcap107 !! LGTM once the last couple of nitpicks are addressed 🎉

skrub/_dataframe/tests/test_common.py

jeromedockes · 2024-10-28T12:35:28Z

skrub/_drop_column_if_null.py

+    def fit_transform(self, column, y=None):
+        """Fit the encoder and transform a column.
+
+        Args:


can you check the formatting of the other estimators in the library, here the section header would look like

Parameters ----------

not super important here as it is a private class, but if we decide to make it public it will be important because sphinx relies on this formatting to produce the reference html documentation

Co-authored-by: Jérôme Dockès <[email protected]>

rcap107 added 7 commits October 17, 2024 11:39

Adding code for DropNull

bee630f

Fixed line

ccc9a02

renamed script

d3f9c90

Added new common functions for drop and is_all_null

9d42b95

Fixed code

f249982

Added test for dropcol

a1caf39

Removing dev script

b0e3235

TheooJ reviewed Oct 18, 2024

View reviewed changes

rcap107 and others added 16 commits October 21, 2024 15:40

Update skrub/tests/test_dropnulls.py

90be825

Co-authored-by: Théo Jolivet <[email protected]>

Renamed file

55764a8

Renamed file

c8fdaaa

Formatting

0cdc0bd

Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …

34c0095

…into drop_null_columns

Rename file

430c8e3

Added docstrings

80bd408

Fixing imports and refactoring names

e2ca33f

Formatting

4dbba09

Updated changelog.

7d6f8ce

Formatting

4771d18

Removing function because it was not needed

f0b521a

Updated test

ea9893b

Merge branch 'main' into drop_null_columns

c73db7e

Improving tests

09cf9c7

Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …

4e4f255

…into drop_null_columns

rcap107 commented Oct 21, 2024

View reviewed changes

Updated test

754e2ef

jeromedockes reviewed Oct 22, 2024

View reviewed changes

skrub/_dataframe/_common.py Outdated Show resolved Hide resolved

skrub/_drop_null.py Outdated Show resolved Hide resolved

Moving DropNullColumn after CleanNullStrings

75f1110

rcap107 commented Oct 22, 2024

View reviewed changes

skrub/_table_vectorizer.py Show resolved Hide resolved

rcap107 added 3 commits October 22, 2024 16:06

Moved check on drop from transform to fit_transform

e499dc1

Fixed changelog

c296829

Moved tests and improved coverage

ee6b7b5

jeromedockes reviewed Oct 23, 2024

View reviewed changes

rcap107 added 10 commits October 24, 2024 10:52

Moved tv test to the proper file

92210b7

Updated test to make it make sense

4cad44a

Improving comment

836a636

Improving comment

4ec95d6

Removed unneeded code

3c25b84

Changed default value to True

8638516

Formatting

e70f513

Added back code that should have been there in the first place

6083567

Changed the default parameter

a543044

Changed to use df interface

92f5430

rcap107 commented Oct 24, 2024

View reviewed changes

rcap107 added 2 commits October 24, 2024 11:41

Merge remote-tracking branch 'main_repo/main' into drop_null_columns

24b18ba

Fixed docstring.

62ef9d6

rcap107 marked this pull request as ready for review October 24, 2024 09:56

Update skrub/_drop_null_column.py

53cb8bd

jeromedockes reviewed Oct 24, 2024

View reviewed changes

rcap107 added 2 commits October 25, 2024 10:42

Renaming transformer to DropColumnIfNull.

7af96ca

Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …

11908b3

…into drop_null_columns

jeromedockes approved these changes Oct 28, 2024

View reviewed changes

rcap107 and others added 5 commits October 29, 2024 15:28

Update skrub/_dataframe/tests/test_common.py

2499a37

Co-authored-by: Jérôme Dockès <[email protected]>

Removed a coverage file

548b792

Fix formatting of docstring

58feaed

Formatting

5a6539c

Whoops

36c46d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added DropNullColumn transformer to remove columns that contain only nulls #1115

Added DropNullColumn transformer to remove columns that contain only nulls #1115

rcap107 commented Oct 17, 2024 •

edited by jeromedockes

Loading

TheooJ left a comment

TheooJ Oct 18, 2024

TheooJ Oct 18, 2024

rcap107 Oct 21, 2024

jeromedockes Oct 23, 2024

TheooJ Oct 18, 2024

rcap107 Oct 21, 2024

rcap107 commented Oct 21, 2024

rcap107 Oct 21, 2024

TheooJ Oct 21, 2024

TheooJ Oct 21, 2024

TheooJ Oct 21, 2024

jeromedockes commented Oct 22, 2024

jeromedockes Oct 22, 2024

rcap107 Oct 22, 2024

jeromedockes Oct 23, 2024

rcap107 Oct 24, 2024

jeromedockes Oct 23, 2024

rcap107 Oct 24, 2024

jeromedockes Oct 24, 2024

jeromedockes Oct 24, 2024

rcap107 Oct 25, 2024

jeromedockes left a comment

jeromedockes Oct 28, 2024

Added DropNullColumn transformer to remove columns that contain only nulls #1115

Are you sure you want to change the base?

Added DropNullColumn transformer to remove columns that contain only nulls #1115

Conversation

rcap107 commented Oct 17, 2024 • edited by jeromedockes Loading

TheooJ left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcap107 commented Oct 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromedockes commented Oct 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromedockes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcap107 commented Oct 17, 2024 •

edited by jeromedockes

Loading