Skip to content

Categorical encoders #431

Merged
merged 12 commits into from
Jan 18, 2022
Merged

Categorical encoders #431

merged 12 commits into from
Jan 18, 2022

Conversation

Ama16
Copy link
Contributor

@Ama16 Ama16 commented Dec 30, 2021

IMPORTANT: Please do not create a Pull Request without creating an issue first.

Before submitting (must do checklist)

  • Did you read the contribution guide?
  • Did you update the docs? We use Numpy format for all the methods and classes.
  • Did you write any new necessary tests?
  • Did you update the CHANGELOG?

Type of Change

  • Examples / docs / tutorials / contributors update
  • Bug fix (non-breaking change which fixes an issue)
  • Improvement (non-breaking change which improves an existing feature)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Proposed Changes

Related Issue

Closing issues

Closes #355

@codecov-commenter
Copy link

codecov-commenter commented Dec 30, 2021

Codecov Report

Merging #431 (689a7b6) into master (3415a16) will increase coverage by 0.18%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #431      +/-   ##
==========================================
+ Coverage   87.80%   87.98%   +0.18%     
==========================================
  Files         114      115       +1     
  Lines        5354     5435      +81     
==========================================
+ Hits         4701     4782      +81     
  Misses        653      653              
Impacted Files Coverage Δ
etna/transforms/__init__.py 100.00% <100.00%> (ø)
etna/transforms/encoders/__init__.py 100.00% <100.00%> (ø)
etna/transforms/encoders/categorical.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3415a16...689a7b6. Read the comment docs.

name of added column. If not given, use `self.__repr__()` or `regressor_{self.__repr__()}` if it is a regressor
strategy:
filling encoding in not fitted values:
- If "new_value", then replace missing dates with '-1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like "dates" isn't valid here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use values here



class LE(preprocessing.LabelEncoder):
def transform(self, y: pd.Series, strategy):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to create Enum class like in other cases like this

Copy link
Contributor

@iKintosh iKintosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly okay. But need some changes.

Kind reminder: if you think that PR is not ready for merge you should make it a draft)

from etna.transforms.base import Transform


class LE(preprocessing.LabelEncoder):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it _LabelEncoder, because this class is not intended for users and longer name makes it easier to understand class meaning.

Comment on lines 36 to 37
class _OneSegmentLabelEncoderTransform(Transform):
"""Replace the values in the column with the Label encoding"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no inverse transform method on purpose?

name of added column. If not given, use `self.__repr__()` or `regressor_{self.__repr__()}` if it is a regressor
strategy:
filling encoding in not fitted values:
- If "new_value", then replace missing dates with '-1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use values here

self.strategy = strategy
super().__init__(transform=_OneSegmentLabelEncoderTransform(self.in_column, self.out_column, self.strategy))

def _get_out_column(self, out_column: Optional[str]) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be named get_column_name as in other transforms

return self.__repr__()


###############################
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to review what's below this line?

elif strategy == "mean":
filling_value = np.mean(encoded[~np.isin(y, diff)])
else:
raise ValueError(f"There are no '{strategy}' strategy exists")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message doesn't look very good.
May be better: The strategy {strategy} doesn't exist.

if self.out_column:
return self.out_column
if self.in_column.startswith("regressor"):
temp_transform = LabelBinarizerTransform(in_column=self.in_column, out_column=self.out_column)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use here just self.__repr__?

if self.out_column:
return self.out_column
if self.in_column.startswith("regressor"):
temp_transform = LabelEncoderTransform(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use here just self.__repr__?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is about this?

return result_df


class LabelBinarizerTransform(PerSegmentWrapper):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add description to the class. What is its purpose.


@pytest.fixture
def two_df_with_new_values():
df1 = TSDataset.to_dataset(generate_periodic_df(3, "2020-01-01", 10, 2, n_segments=2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use named parameters here because it is not obvious what is the meaning of all these parameters.

- If "mean", then replace missing dates using the mean in encoded column
- If "none", then replace missing dates with None
inplace:
if True, apply resampling inplace to in_column, if False, add transformed column to dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resampling is irrelevant here

self.strategy = strategy
self.out_column = out_column
super().__init__(
transform=_OneSegmentLabelEncoderTransform(self.in_column, self.out_column, self.strategy, self.inplace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use keyword parameters here like in_column=self.in_column.
This can be a reason for warning during creation with out_column=None.

@Mr-Geekman Mr-Geekman linked an issue Jan 12, 2022 that may be closed by this pull request
1 task

encoded = _encode(y, uniques=self.classes_, check_unknown=False).astype(float)

if strategy == "None":
Copy link
Contributor

@Mr-Geekman Mr-Geekman Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threre is no such value as "None", only "none".

("mean", np.array([[0, 0], [1, 0], [0.5, 0]])),
],
)
def test_new_value_label(two_df_with_new_values, strategy, expected_values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

label_encoder in the name of the test is omitted.

same issue with word order in description

)


def test_label_encoder(df_for_categorical_encoding):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be better to name "test_label_encoder_simple"



def test_label_encoder(df_for_categorical_encoding):
"""Test LabelEncoderTransform correct works."""
Copy link
Contributor

@Mr-Geekman Mr-Geekman Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be more correct to use
"Test that LabelEncoderTransform works correct in a simple case."

return result_df


class OneHotEncoderTransform(PerSegmentWrapper):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe how it works with new values.

class OneHotEncoderTransform(PerSegmentWrapper):
"""
Encode categorical feature as a one-hot numeric features.
If unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we have to add empty line after the first line with short description: numpydoc.

future_ts = train_ts.make_future(10)
forecast_ts = model.forecast(future_ts)
r2 = R2()
assert 1 - r2(forecast_ts, test_ts)["segment_0"] < 1e-5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In metric we expect y_true to be the first parameter.

return train_ts, test_ts


def test_ohe_sanity(ts_for_ohe_sanity):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should

  1. Add description to the test.
  2. Move it to the very of the file (fixture should be moved to the end of fixtures for this file).


df_to_forecast["segment_0", "target"] = df_regressors["segment_0"]["regressor_0"][:100].apply(f)
ts = TSDataset(df=df_to_forecast, freq="D", df_exog=df_regressors)
train_ts, test_ts = ts.train_test_split(test_size=10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move train_test_split into the test, because they have the same parameter horizon=10.


@pytest.fixture
def ts_for_ohe_sanity():
df_to_forecast = generate_ar_df(100, start_time="2021-01-01", n_segments=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add periods parameter in cases like this.

self.strategy = strategy
self.out_column = out_column
self.out_column = self._get_column_name()
super().__init__(transform=_OneSegmentLabelEncoderTransform(self.in_column, self.out_column, self.strategy))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use keyword parameters to create object.

iKintosh
iKintosh previously approved these changes Jan 17, 2022
Copy link
Contributor

@iKintosh iKintosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like everything is okay.

@Mr-Geekman Mr-Geekman merged commit 8b93c44 into master Jan 18, 2022
@Mr-Geekman Mr-Geekman deleted the issue-355 branch January 18, 2022 08:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add CategoricalEncoderTransforms
4 participants