Categorical encoders #431

Ama16 · 2021-12-30T07:28:31Z

IMPORTANT: Please do not create a Pull Request without creating an issue first.

Before submitting (must do checklist)

Did you read the contribution guide?
Did you update the docs? We use Numpy format for all the methods and classes.
Did you write any new necessary tests?
Did you update the CHANGELOG?

Type of Change

Examples / docs / tutorials / contributors update
Bug fix (non-breaking change which fixes an issue)
Improvement (non-breaking change which improves an existing feature)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Proposed Changes

Related Issue

Closing issues

Closes #355

codecov-commenter · 2021-12-30T07:34:48Z

Codecov Report

Merging #431 (689a7b6) into master (3415a16) will increase coverage by 0.18%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #431      +/-   ##
==========================================
+ Coverage   87.80%   87.98%   +0.18%     
==========================================
  Files         114      115       +1     
  Lines        5354     5435      +81     
==========================================
+ Hits         4701     4782      +81     
  Misses        653      653

Impacted Files	Coverage Δ
etna/transforms/__init__.py	`100.00% <100.00%> (ø)`
etna/transforms/encoders/__init__.py	`100.00% <100.00%> (ø)`
etna/transforms/encoders/categorical.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3415a16...689a7b6. Read the comment docs.

Mr-Geekman · 2021-12-30T08:13:39Z

etna/transforms/encoders/categorical.py

+            name of added column. If not given, use `self.__repr__()` or `regressor_{self.__repr__()}` if it is a regressor
+        strategy:
+            filling encoding in not fitted values:
+            - If "new_value", then replace missing dates with '-1'


It seems like "dates" isn't valid here.

better to use values here

Mr-Geekman · 2021-12-30T08:16:02Z

etna/transforms/encoders/categorical.py

+
+
+class LE(preprocessing.LabelEncoder):
+    def transform(self, y: pd.Series, strategy):


It is better to create Enum class like in other cases like this

iKintosh

Mostly okay. But need some changes.

Kind reminder: if you think that PR is not ready for merge you should make it a draft)

iKintosh · 2021-12-30T09:33:49Z

etna/transforms/encoders/categorical.py

+from etna.transforms.base import Transform
+
+
+class LE(preprocessing.LabelEncoder):


Make it _LabelEncoder, because this class is not intended for users and longer name makes it easier to understand class meaning.

iKintosh · 2021-12-30T09:35:39Z

etna/transforms/encoders/categorical.py

+class _OneSegmentLabelEncoderTransform(Transform):
+    """Replace the values in the column with the Label encoding"""


There is no inverse transform method on purpose?

iKintosh · 2021-12-30T09:36:19Z

etna/transforms/encoders/categorical.py

+            name of added column. If not given, use `self.__repr__()` or `regressor_{self.__repr__()}` if it is a regressor
+        strategy:
+            filling encoding in not fitted values:
+            - If "new_value", then replace missing dates with '-1'


better to use values here

iKintosh · 2021-12-30T09:38:19Z

etna/transforms/encoders/categorical.py

+        self.strategy = strategy
+        super().__init__(transform=_OneSegmentLabelEncoderTransform(self.in_column, self.out_column, self.strategy))
+
+    def _get_out_column(self, out_column: Optional[str]) -> str:


it should be named get_column_name as in other transforms

iKintosh · 2021-12-30T09:38:50Z

etna/transforms/encoders/categorical.py

+        return self.__repr__()
+
+
+###############################


Do we need to review what's below this line?

Mr-Geekman · 2022-01-11T14:29:01Z

etna/transforms/encoders/categorical.py

+        elif strategy == "mean":
+            filling_value = np.mean(encoded[~np.isin(y, diff)])
+        else:
+            raise ValueError(f"There are no '{strategy}' strategy exists")


This message doesn't look very good.
May be better: The strategy {strategy} doesn't exist.

etna/transforms/encoders/categorical.py

Mr-Geekman · 2022-01-11T14:32:14Z

etna/transforms/encoders/categorical.py

+        if self.out_column:
+            return self.out_column
+        if self.in_column.startswith("regressor"):
+            temp_transform = LabelBinarizerTransform(in_column=self.in_column, out_column=self.out_column)


Can't we use here just self.__repr__?

Mr-Geekman · 2022-01-11T14:32:50Z

etna/transforms/encoders/categorical.py

+        if self.out_column:
+            return self.out_column
+        if self.in_column.startswith("regressor"):
+            temp_transform = LabelEncoderTransform(


Can't we use here just self.__repr__?

What is about this?

Mr-Geekman · 2022-01-11T14:33:12Z

etna/transforms/encoders/categorical.py

+        return result_df
+
+
+class LabelBinarizerTransform(PerSegmentWrapper):


Add description to the class. What is its purpose.

tests/test_transforms/test_encoders/test_categorical_transform.py

Mr-Geekman · 2022-01-11T14:41:57Z

tests/test_transforms/test_encoders/test_categorical_transform.py

+
+@pytest.fixture
+def two_df_with_new_values():
+    df1 = TSDataset.to_dataset(generate_periodic_df(3, "2020-01-01", 10, 2, n_segments=2))


Use named parameters here because it is not obvious what is the meaning of all these parameters.

Mr-Geekman · 2022-01-11T15:14:25Z

etna/transforms/encoders/categorical.py

+            - If "mean", then replace missing dates using the mean in encoded column
+            - If "none", then replace missing dates with None
+        inplace:
+            if True, apply resampling inplace to in_column, if False, add transformed column to dataset


Resampling is irrelevant here

Mr-Geekman · 2022-01-11T15:19:49Z

etna/transforms/encoders/categorical.py

+        self.strategy = strategy
+        self.out_column = out_column
+        super().__init__(
+            transform=_OneSegmentLabelEncoderTransform(self.in_column, self.out_column, self.strategy, self.inplace)


Use keyword parameters here like in_column=self.in_column.
This can be a reason for warning during creation with out_column=None.

Mr-Geekman · 2022-01-12T12:33:45Z

etna/transforms/encoders/categorical.py

+
+        encoded = _encode(y, uniques=self.classes_, check_unknown=False).astype(float)
+
+        if strategy == "None":


Threre is no such value as "None", only "none".

Mr-Geekman · 2022-01-12T13:05:27Z

tests/test_transforms/test_encoders/test_categorical_transform.py

+        ("mean", np.array([[0, 0], [1, 0], [0.5, 0]])),
+    ],
+)
+def test_new_value_label(two_df_with_new_values, strategy, expected_values):


label_encoder in the name of the test is omitted.

same issue with word order in description

Mr-Geekman · 2022-01-12T13:06:12Z

tests/test_transforms/test_encoders/test_categorical_transform.py

+    )
+
+
+def test_label_encoder(df_for_categorical_encoding):


May be better to name "test_label_encoder_simple"

Mr-Geekman · 2022-01-12T13:07:00Z

tests/test_transforms/test_encoders/test_categorical_transform.py

+
+
+def test_label_encoder(df_for_categorical_encoding):
+    """Test LabelEncoderTransform correct works."""


It will be more correct to use
"Test that LabelEncoderTransform works correct in a simple case."

Mr-Geekman · 2022-01-12T13:17:42Z

etna/transforms/encoders/categorical.py

+        return result_df
+
+
+class OneHotEncoderTransform(PerSegmentWrapper):


Describe how it works with new values.

Mr-Geekman · 2022-01-14T07:20:04Z

etna/transforms/encoders/categorical.py

+class OneHotEncoderTransform(PerSegmentWrapper):
+    """
+    Encode categorical feature as a one-hot numeric features.
+    If unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros.


It seems like we have to add empty line after the first line with short description: numpydoc.

Mr-Geekman · 2022-01-14T07:21:51Z

tests/test_transforms/test_encoders/test_categorical_transform.py

+    future_ts = train_ts.make_future(10)
+    forecast_ts = model.forecast(future_ts)
+    r2 = R2()
+    assert 1 - r2(forecast_ts, test_ts)["segment_0"] < 1e-5


In metric we expect y_true to be the first parameter.

Mr-Geekman · 2022-01-14T07:23:20Z

tests/test_transforms/test_encoders/test_categorical_transform.py

+    return train_ts, test_ts
+
+
+def test_ohe_sanity(ts_for_ohe_sanity):


I think we should

Add description to the test.

Move it to the very of the file (fixture should be moved to the end of fixtures for this file).

Mr-Geekman · 2022-01-14T07:27:50Z

tests/test_transforms/test_encoders/test_categorical_transform.py

+
+    df_to_forecast["segment_0", "target"] = df_regressors["segment_0"]["regressor_0"][:100].apply(f)
+    ts = TSDataset(df=df_to_forecast, freq="D", df_exog=df_regressors)
+    train_ts, test_ts = ts.train_test_split(test_size=10)


I think we should move train_test_split into the test, because they have the same parameter horizon=10.

Mr-Geekman · 2022-01-14T07:33:57Z

tests/test_transforms/test_encoders/test_categorical_transform.py

+
+@pytest.fixture
+def ts_for_ohe_sanity():
+    df_to_forecast = generate_ar_df(100, start_time="2021-01-01", n_segments=1)


Add periods parameter in cases like this.

Mr-Geekman · 2022-01-14T08:14:08Z

etna/transforms/encoders/categorical.py

+        self.strategy = strategy
+        self.out_column = out_column
+        self.out_column = self._get_column_name()
+        super().__init__(transform=_OneSegmentLabelEncoderTransform(self.in_column, self.out_column, self.strategy))


Use keyword parameters to create object.

iKintosh

Seems like everything is okay.

Categorical encoders

a9382ea

iKintosh requested review from iKintosh and Mr-Geekman December 30, 2021 07:43

Mr-Geekman reviewed Dec 30, 2021

View reviewed changes

iKintosh suggested changes Dec 30, 2021

View reviewed changes

add all transforms

298c45c

Mr-Geekman reviewed Jan 11, 2022

View reviewed changes

etna/transforms/encoders/categorical.py Show resolved Hide resolved

Mr-Geekman reviewed Jan 11, 2022

View reviewed changes

tests/test_transforms/test_encoders/test_categorical_transform.py Outdated Show resolved Hide resolved

Mr-Geekman reviewed Jan 11, 2022

View reviewed changes

Mr-Geekman linked an issue Jan 12, 2022 that may be closed by this pull request

Add CategoricalEncoderTransforms #355

Closed

1 task

minor fix and changelog

b4757aa

Mr-Geekman reviewed Jan 12, 2022

View reviewed changes

Ama16 added 2 commits January 13, 2022 13:30

new tests

6827f2b

add sanity test

55f4a00

Mr-Geekman reviewed Jan 14, 2022

View reviewed changes

Ama16 added 5 commits January 17, 2022 11:50

minor fixes

3c0d2ea

remove inplace

55d7a0e

minor

dc884c3

resolve conflicts

dc43c79

lint

e2ade6f

iKintosh previously approved these changes Jan 17, 2022

View reviewed changes

Mr-Geekman and others added 2 commits January 18, 2022 11:33

Merge branch 'master' into issue-355

f0741fe

fix docstring

689a7b6

Ama16 dismissed iKintosh’s stale review via 689a7b6 January 18, 2022 08:38

Mr-Geekman approved these changes Jan 18, 2022

View reviewed changes

Mr-Geekman merged commit 8b93c44 into master Jan 18, 2022

Mr-Geekman deleted the issue-355 branch January 18, 2022 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical encoders #431

Categorical encoders #431

Ama16 commented Dec 30, 2021 •

edited

Loading

codecov-commenter commented Dec 30, 2021 •

edited

Loading

Mr-Geekman Dec 30, 2021

iKintosh Dec 30, 2021

Mr-Geekman Dec 30, 2021

iKintosh left a comment

iKintosh Dec 30, 2021

iKintosh Dec 30, 2021

iKintosh Dec 30, 2021

iKintosh Dec 30, 2021

iKintosh Dec 30, 2021

Mr-Geekman Jan 11, 2022

Mr-Geekman Jan 11, 2022

Mr-Geekman Jan 11, 2022

Mr-Geekman Jan 12, 2022

Mr-Geekman Jan 11, 2022

Mr-Geekman Jan 11, 2022

Mr-Geekman Jan 11, 2022

Mr-Geekman Jan 11, 2022

Mr-Geekman Jan 12, 2022 •

edited

Loading

Mr-Geekman Jan 12, 2022

Mr-Geekman Jan 12, 2022

Mr-Geekman Jan 12, 2022 •

edited

Loading

Mr-Geekman Jan 12, 2022

Mr-Geekman Jan 14, 2022

Mr-Geekman Jan 14, 2022

Mr-Geekman Jan 14, 2022

Mr-Geekman Jan 14, 2022

Mr-Geekman Jan 14, 2022

Mr-Geekman Jan 14, 2022

iKintosh left a comment



		class LE(preprocessing.LabelEncoder):
		def transform(self, y: pd.Series, strategy):

		from etna.transforms.base import Transform


		class LE(preprocessing.LabelEncoder):

		class _OneSegmentLabelEncoderTransform(Transform):
		"""Replace the values in the column with the Label encoding"""

		return result_df


		class LabelBinarizerTransform(PerSegmentWrapper):


		encoded = _encode(y, uniques=self.classes_, check_unknown=False).astype(float)

		if strategy == "None":



		def test_label_encoder(df_for_categorical_encoding):
		"""Test LabelEncoderTransform correct works."""

		return result_df


		class OneHotEncoderTransform(PerSegmentWrapper):

		return train_ts, test_ts


		def test_ohe_sanity(ts_for_ohe_sanity):

Categorical encoders #431

Categorical encoders #431

Conversation

Ama16 commented Dec 30, 2021 • edited Loading

Before submitting (must do checklist)

Type of Change

Proposed Changes

Related Issue

Closing issues

codecov-commenter commented Dec 30, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iKintosh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr-Geekman Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr-Geekman Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iKintosh left a comment

Choose a reason for hiding this comment

Ama16 commented Dec 30, 2021 •

edited

Loading

codecov-commenter commented Dec 30, 2021 •

edited

Loading

Mr-Geekman Jan 12, 2022 •

edited

Loading

Mr-Geekman Jan 12, 2022 •

edited

Loading