Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DropNullColumn transformer to remove columns that contain only nulls #1115

Open
wants to merge 52 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
bee630f
Adding code for DropNull
rcap107 Oct 17, 2024
ccc9a02
Fixed line
rcap107 Oct 17, 2024
d3f9c90
renamed script
rcap107 Oct 17, 2024
9d42b95
Added new common functions for drop and is_all_null
rcap107 Oct 17, 2024
f249982
Fixed code
rcap107 Oct 17, 2024
a1caf39
Added test for dropcol
rcap107 Oct 17, 2024
b0e3235
Removing dev script
rcap107 Oct 17, 2024
90be825
Update skrub/tests/test_dropnulls.py
rcap107 Oct 21, 2024
55764a8
Renamed file
rcap107 Oct 21, 2024
c8fdaaa
Renamed file
rcap107 Oct 21, 2024
0cdc0bd
Formatting
rcap107 Oct 21, 2024
34c0095
Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …
rcap107 Oct 21, 2024
430c8e3
Rename file
rcap107 Oct 21, 2024
80bd408
Added docstrings
rcap107 Oct 21, 2024
e2ca33f
Fixing imports and refactoring names
rcap107 Oct 21, 2024
4dbba09
Formatting
rcap107 Oct 21, 2024
7d6f8ce
Updated changelog.
rcap107 Oct 21, 2024
4771d18
Formatting
rcap107 Oct 21, 2024
f0b521a
Removing function because it was not needed
rcap107 Oct 21, 2024
ea9893b
Updated test
rcap107 Oct 21, 2024
c73db7e
Merge branch 'main' into drop_null_columns
rcap107 Oct 21, 2024
09cf9c7
Improving tests
rcap107 Oct 21, 2024
4e4f255
Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …
rcap107 Oct 21, 2024
754e2ef
Updated test
rcap107 Oct 22, 2024
acafac6
Merge remote-tracking branch 'main_repo/main' into drop_null_columns
rcap107 Oct 22, 2024
4b0aa1c
Fixed is_all_null based on comments
rcap107 Oct 22, 2024
35f8909
Renaming files for consistency
rcap107 Oct 22, 2024
b4e419f
Removing init
rcap107 Oct 22, 2024
75f1110
Moving DropNullColumn after CleanNullStrings
rcap107 Oct 22, 2024
e499dc1
Moved check on drop from transform to fit_transform
rcap107 Oct 22, 2024
c296829
Fixed changelog
rcap107 Oct 22, 2024
ee6b7b5
Moved tests and improved coverage
rcap107 Oct 22, 2024
92210b7
Moved tv test to the proper file
rcap107 Oct 24, 2024
4cad44a
Updated test to make it make sense
rcap107 Oct 24, 2024
836a636
Improving comment
rcap107 Oct 24, 2024
4ec95d6
Improving comment
rcap107 Oct 24, 2024
3c25b84
Removed unneeded code
rcap107 Oct 24, 2024
8638516
Changed default value to True
rcap107 Oct 24, 2024
e70f513
Formatting
rcap107 Oct 24, 2024
6083567
Added back code that should have been there in the first place
rcap107 Oct 24, 2024
a543044
Changed the default parameter
rcap107 Oct 24, 2024
92f5430
Changed to use df interface
rcap107 Oct 24, 2024
24b18ba
Merge remote-tracking branch 'main_repo/main' into drop_null_columns
rcap107 Oct 24, 2024
62ef9d6
Fixed docstring.
rcap107 Oct 24, 2024
53cb8bd
Update skrub/_drop_null_column.py
jeromedockes Oct 24, 2024
7af96ca
Renaming transformer to DropColumnIfNull.
rcap107 Oct 25, 2024
11908b3
Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …
rcap107 Oct 25, 2024
2499a37
Update skrub/_dataframe/tests/test_common.py
rcap107 Oct 29, 2024
548b792
Removed a coverage file
rcap107 Oct 29, 2024
58feaed
Fix formatting of docstring
rcap107 Oct 29, 2024
5a6539c
Formatting
rcap107 Oct 29, 2024
36c46d4
Whoops
rcap107 Oct 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,10 @@ Minor changes
contained datetimes with time zones and missing values; this has been fixed in
:pr:`1114` by :user:`Jérôme Dockès <jeromedockes>`.

* Added a `DropNullColumn` transformer that drops columns that contain only null
values. :pr:`1115` by :user: `Riccardo Cappuzzo <riccardocappuzzo>`


Release 0.3.1
=============

Expand Down
22 changes: 22 additions & 0 deletions skrub/_dataframe/_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@
"to_datetime",
"is_categorical",
"to_categorical",
"is_all_null",
#
# Inspecting, selecting and modifying values
#
Expand Down Expand Up @@ -825,6 +826,27 @@
return _cast_polars(col, pl.Categorical())


@dispatch
def is_all_null(col):
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
raise NotImplementedError()


@is_all_null.specialize("pandas", argument_type="Column")
def _is_all_null_pandas(col):
return bool(col.isna().all())
rcap107 marked this conversation as resolved.
Show resolved Hide resolved


@is_all_null.specialize("polars", argument_type="Column")
def _is_all_null_polars(col):
if col.dtype == pl.Null:
return True

Check warning on line 842 in skrub/_dataframe/_common.py

View check run for this annotation

Codecov / codecov/patch

skrub/_dataframe/_common.py#L842

Added line #L842 was not covered by tests
elif col.dtype.is_numeric() and col.is_nan().all():
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
return True

Check warning on line 844 in skrub/_dataframe/_common.py

View check run for this annotation

Codecov / codecov/patch

skrub/_dataframe/_common.py#L844

Added line #L844 was not covered by tests
# col is non numeric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was a leftover from the previous version, I rewrote the comments

elif col.null_count() == col.len():
return True

Check warning on line 847 in skrub/_dataframe/_common.py

View check run for this annotation

Codecov / codecov/patch

skrub/_dataframe/_common.py#L847

Added line #L847 was not covered by tests

rcap107 marked this conversation as resolved.
Show resolved Hide resolved

#
# Inspecting, selecting and modifying values
# ==========================================
Expand Down
54 changes: 54 additions & 0 deletions skrub/_drop_null.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# drop columns that contain all null values
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
from sklearn.utils.validation import check_is_fitted

from . import _dataframe as sbd
from ._on_each_column import SingleColumnTransformer

__all__ = ["DropNullColumn"]


class DropNullColumn(SingleColumnTransformer):
"""Drop a single column if it contains only null values."""

def __init__(self):
super().__init__()
self._is_fitted = False
rcap107 marked this conversation as resolved.
Show resolved Hide resolved

def fit_transform(self, column, y=None):
"""Fit the encoder and transform a column.

Args:
column : Pandas or Polars series. The input column to check.
y : None. Ignored.

Returns:
The input column, or an empty list if the column contains only null values.
"""
del y

self._is_fitted = True
return self.transform(column)
rcap107 marked this conversation as resolved.
Show resolved Hide resolved

def transform(self, column):
"""Transform a column.

Args:
column : Pandas or Polars series. The input column to check.

Returns:
The input column, or an empty list if the column contains only null values.
"""
check_is_fitted(
self,
)

if sbd.is_all_null(column):
return []
else:
return column

def __sklearn_is_fitted__(self):
"""
Check fitted status and return a Boolean value.
"""
return hasattr(self, "_is_fitted") and self._is_fitted
9 changes: 9 additions & 0 deletions skrub/_table_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from ._clean_categories import CleanCategories
from ._clean_null_strings import CleanNullStrings
from ._datetime_encoder import DatetimeEncoder
from ._drop_null import DropNullColumn
from ._gap_encoder import GapEncoder
from ._on_each_column import SingleColumnTransformer
from ._select_cols import Drop
Expand Down Expand Up @@ -191,6 +192,9 @@ class TableVectorizer(TransformerMixin, BaseEstimator):
similar functionality to what is offered by scikit-learn's
:class:`~sklearn.compose.ColumnTransformer`.

drop_null_columns : bool, default=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want it to be True by default ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be discussed with others I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for true by default -- there's nothing we can learn from a completely empty column.

if it is False by default, I think it should be set to True in the tabular_learner

If set to `True`, columns that contain only null values are dropped.

n_jobs : int, default=None
Number of jobs to run in parallel.
``None`` means 1 unless in a joblib ``parallel_backend`` context.
Expand Down Expand Up @@ -412,6 +416,7 @@ def __init__(
numeric=NUMERIC_TRANSFORMER,
datetime=DATETIME_TRANSFORMER,
specific_transformers=(),
drop_null_columns=False,
n_jobs=None,
):
self.cardinality_threshold = cardinality_threshold
Expand All @@ -425,6 +430,7 @@ def __init__(
self.datetime = _utils.clone_if_default(datetime, DATETIME_TRANSFORMER)
self.specific_transformers = specific_transformers
self.n_jobs = n_jobs
self.drop_null_columns = drop_null_columns

def fit(self, X, y=None):
"""Fit transformer.
Expand Down Expand Up @@ -536,6 +542,9 @@ def add_step(steps, transformer, cols, allow_reject=False):
cols = s.all() - self._specific_columns

self._preprocessors = [CheckInputDataFrame()]
if self.drop_null_columns:
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
add_step(self._preprocessors, DropNullColumn(), cols, allow_reject=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • we may want to insert it after CleanNullStrings? so that if the column becomes full of nulls after converting "N/A" to null it will be dropped. also it's not important but your transformer never raises a RejectColumn exception so allow_reject has no effect you don't need it here and can leave the default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it after CleanNullStrings, but I think I did it in an ugly way, maybe it can be fixed


for transformer in [
CleanNullStrings(),
ToDatetime(),
Expand Down
87 changes: 87 additions & 0 deletions skrub/tests/test_drop_nulls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import numpy as np
import pytest
from numpy.testing import assert_array_equal

from skrub import TableVectorizer
from skrub import _dataframe as sbd
from skrub._drop_null import DropNullColumn


@pytest.fixture
def drop_null_table(df_module):
return df_module.make_dataframe(
{
"idx": [
1,
2,
3,
],
"value_nan": [
np.nan,
np.nan,
np.nan,
],
"value_null": [
None,
None,
None,
],
"value_almost_nan": [
2.5,
np.nan,
np.nan,
],
"value_almost_null": [
"almost",
None,
None,
],
}
)


def test_single_column(drop_null_table):
"""Check that null columns are dropped and non-null columns are kept."""
dn = DropNullColumn()
assert dn.fit_transform(drop_null_table["value_nan"]) == []
assert dn.fit_transform(drop_null_table["value_null"]) == []

assert_array_equal(
sbd.to_numpy(sbd.col(drop_null_table, "idx")), np.array([1, 2, 3])
)

assert_array_equal(
sbd.to_numpy(sbd.col(drop_null_table, "value_almost_nan")),
np.array([2.5, np.nan, np.nan]),
)

# assert_array_equal(
# sbd.to_numpy(sbd.col(drop_null_table, "value_almost_null")),
# np.array(["almost", None, None]),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to write this check so that it works with either pandas or polars

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use df_module as a fixture in the test by adding it to the arguments, then comparing series instead of numpy arrays:

df_module.assert_column_equal(
    sbd.col(drop_null_table, "value_almost_null"),
    df_module.make_column("value_almost_null", ["almost", None, None]),
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test would look like

def test_single_column(drop_null_table, df_module):
    """Check that null columns are dropped and non-null columns are kept."""
    dn = DropNullColumn()
    assert dn.fit_transform(drop_null_table["value_nan"]) == []
    assert dn.fit_transform(drop_null_table["value_null"]) == []

    df_module.assert_column_equal(
        sbd.col(drop_null_table, "idx"), df_module.make_column("idx", [1, 2, 3])
    )

    df_module.assert_column_equal(
        sbd.col(drop_null_table, "value_almost_nan"),
        df_module.make_column("value_almost_nan", [2.5, np.nan, np.nan]),
    )

    df_module.assert_column_equal(
        sbd.col(drop_null_table, "value_almost_null"),
        df_module.make_column("value_almost_null", ["almost", None, None]),
    )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also circumvents that depending on the version of pandas, null values are not treated the same

# )


def test_drop_null_column(drop_null_table):
"""Check that all null columns are dropped, and no more."""
# Don't drop null columns
tv = TableVectorizer(drop_null_columns=False)
transformed = tv.fit_transform(drop_null_table)

assert sbd.shape(transformed) == sbd.shape(drop_null_table)

# Drop null columns
tv = TableVectorizer(drop_null_columns=True)
transformed = tv.fit_transform(drop_null_table)
assert sbd.shape(transformed) == (sbd.shape(drop_null_table)[0], 3)


def test_is_all_null(drop_null_table):
"""Check that is_all_null is evaluating null counts correctly."""
# Check that all null columns are marked as "all null"
assert sbd.is_all_null(drop_null_table["value_nan"])
assert sbd.is_all_null(drop_null_table["value_null"])

# Check that the other columns are *not* marked as "all null"
assert not sbd.is_all_null(drop_null_table["value_almost_null"])
assert not sbd.is_all_null(drop_null_table["value_almost_nan"])
assert not sbd.is_all_null(drop_null_table["idx"])
Loading