Skip to content

Commit

Permalink
Merge branch 'feature-engine:main' into profiling_functionality
Browse files Browse the repository at this point in the history
  • Loading branch information
Okroshiashvili authored Apr 24, 2023
2 parents e98fe3c + feddb06 commit ae009d8
Show file tree
Hide file tree
Showing 24 changed files with 550 additions and 18 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -151,4 +151,4 @@ workflows:
filters:
branches:
only:
- 1.5.X
- 1.6.X
1 change: 1 addition & 0 deletions docs/whats_new/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Find out what's new in each new version release.
.. toctree::
:maxdepth: 2

v_160
v_150
v_140
v_130
Expand Down
94 changes: 94 additions & 0 deletions docs/whats_new/v_160.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
Version 1.6.X
=============

Version 1.6.0
-------------

Deployed: 16th March 2023

Contributors
~~~~~~~~~~~~

- `Gleb Levitski <https://github.com/GLevv>`_
- `Morgan Sell <https://github.com/Morgan-Sell>`_
- `Alfonso Tobar <https://github.com/datacubeR>`_
- `Nodar Okroshiashvili <https://github.com/Okroshiashvili>`_
- `Luís Seabra <https://github.com/luismavs>`_
- `Kyle Gilde <https://github.com/kylegilde>`_
- `Soledad Galli <https://github.com/solegalli>`_

In this release, we make Feature-engine transformers compatible with the `set_output`
API from Scikit-learn, which was released in version 1.2.0. We also make Feature-engine
compatible with the newest direction of pandas, in removing the `inplace` functionality
that our transformers use under the hood.

We introduce a major change: most of the **categorical encoders can now encode variables
even if they have missing data**.

We are also releasing **3 brand new transformers**: One for discretization, one for feature
selection and one for operations between datetime variables.

We also made a major improvement in the performance of the `DropDuplicateFeatures` and some
smaller bug fixes here and there.

We'd like to thank all contributors for fixing bugs and expanding the functionality
and documentation of Feature-engine.

Thank you so much to all contributors and to those of you who created issues flagging bugs or
requesting new functionality.

New transformers
~~~~~~~~~~~~~~~~

- **ProbeFeatureSelection**: introduces random features and selects variables whose importance is greater than the random ones (`Morgan Sell <https://github.com/Morgan-Sell>`_ and `Soledad Galli <https://github.com/solegalli>`_)
- **DatetimeSubtraction**: creates new features by subtracting datetime variables (`Kyle Gilde <https://github.com/kylegilde>`_ and `Soledad Galli <https://github.com/solegalli>`_)
- **GeometricWidthDiscretiser**: sorts continuous variables into intervals determined by geometric progression (`Gleb Levitski <https://github.com/GLevv>`_)

New functionality
~~~~~~~~~~~~~~~~~

- Allow categorical encoders to encode variables with NaN (`Soledad Galli <https://github.com/solegalli>`_)
- Make transformers compatible with new `set_output` functionality from sklearn (`Soledad Galli <https://github.com/solegalli>`_)
- The `ArbitraryDiscretiser()` now includes the lowest limits in the intervals (`Soledad Galli <https://github.com/solegalli>`_)

New modules
~~~~~~~~~~~

- New **Datasets** module with functions to load specific datasets (`Alfonso Tobar <https://github.com/datacubeR>`_)
- New **variable_handling** module with functions to automatically select numerical, categorical, or datetime variables (`Soledad Galli <https://github.com/solegalli>`_)

Bug fixes
~~~~~~~~~

- Fixed bug in `DropFeatures()` (`Luís Seabra <https://github.com/luismavs>`_)
- Fixed bug in `RecursiveFeatureElimination()` caused when only 1 feature remained in data (`Soledad Galli <https://github.com/solegalli>`_)

Documentation
~~~~~~~~~~~~~

- Add example code snippets to the selection module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
- Add example code snippets to the outlier module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
- Add example code snippets to the transformation module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
- Add example code snippets to the time series module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
- Add example code snippets to the preprocessing module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
- Add example code snippets to the wrapper module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
- Updated documentation using new Dataset module (`Alfonso Tobar <https://github.com/datacubeR>`_ and `Soledad Galli <https://github.com/solegalli>`_)
- Reorganized Readme badges (`Gleb Levitski <https://github.com/GLevv>`_)
- New Jupyter notebooks for `GeometricWidthDiscretiser` (`Gleb Levitski <https://github.com/GLevv>`_)
- Fixed typos (`Gleb Levitski <https://github.com/GLevv>`_)
- Remove examples using the boston house dataset (`Soledad Galli <https://github.com/solegalli>`_)
- Update sponsor page and contribute page (`Soledad Galli <https://github.com/solegalli>`_)


Deprecations
~~~~~~~~~~~~

- The class `PRatioEncoder` is no longer supported and was removed from the API (`Soledad Galli <https://github.com/solegalli>`_)

Code improvements
~~~~~~~~~~~~~~~~~

- Massive improvement in the performance (speed) of `DropDuplicateFeatures()` (`Nodar Okroshiashvili <https://github.com/Okroshiashvili>`_)
- Remove `inplace` and other issues related to pandas new direction (`Luís Seabra <https://github.com/luismavs>`_)
- Move most docstrings to dedicated docstrings module (`Soledad Galli <https://github.com/solegalli>`_)
- Unnest tests for encoders (`Soledad Galli <https://github.com/solegalli>`_)
2 changes: 1 addition & 1 deletion feature_engine/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.5.2
1.6.0
2 changes: 1 addition & 1 deletion feature_engine/datetime/datetime_subtraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,7 +318,7 @@ def _sub(self, dt_df: pd.DataFrame):
new_df[new_varnames] = (
dt_df[self.variables_]
.sub(dt_df[reference], axis=0)
.apply(lambda s: s / np.timedelta64(1, self.output_unit))
.div(np.timedelta64(1, self.output_unit).astype("timedelta64[ns]"))
)

if self.new_variables_names is not None:
Expand Down
2 changes: 1 addition & 1 deletion feature_engine/imputation/drop_missing_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ def return_na_data(self, X: pd.DataFrame) -> pd.DataFrame:
idx = pd.isnull(X[self.variables_]).mean(axis=1) >= self.threshold
idx = idx[idx]
else:
idx = pd.isnull(X[self.variables_]).any(1)
idx = pd.isnull(X[self.variables_]).any(axis=1)
idx = idx[idx]

return X.loc[idx.index, :]
Expand Down
21 changes: 21 additions & 0 deletions feature_engine/outliers/artbitrary.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,27 @@ class ArbitraryOutlierCapper(BaseOutlier):
transform:
Cap the variables.
Examples
--------
>>> import pandas as pd
>>> from feature_engine.outliers import ArbitraryOutlierCapper
>>> X = pd.DataFrame(dict(x1 = [1,2,3,4,5,6,7,8,9,10]))
>>> aoc = ArbitraryOutlierCapper(max_capping_dict=dict(x1 = 8),
>>> min_capping_dict=dict(x1 = 2))
>>> aoc.fit(X)
>>> aoc.transform(X)
x1
0 2
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 8
9 8
"""

def __init__(
Expand Down
55 changes: 55 additions & 0 deletions feature_engine/outliers/trimmer.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,61 @@ class OutlierTrimmer(WinsorizerBase):
transform:
Remove outliers.
Examples
--------
>>> import pandas as pd
>>> from feature_engine.outliers import OutlierTrimmer
>>> X = pd.DataFrame(dict(x = [0.49671,
>>> -0.1382,
>>> 0.64768,
>>> 1.52302,
>>> -0.2341,
>>> -17.2341,
>>> 1.57921,
>>> 0.76743,
>>> -0.4694,
>>> 0.54256]))
>>> ot = OutlierTrimmer(capping_method='gaussian', tail='left', fold=3)
>>> ot.fit(X)
>>> ot.transform(X)
x
0 0.49671
1 -0.13820
2 0.64768
3 1.52302
4 -0.23410
5 -17.23410
6 1.57921
7 0.76743
8 -0.46940
9 0.54256
>>> import pandas as pd
>>> from feature_engine.outliers import OutlierTrimmer
>>> X = pd.DataFrame(dict(x = [0.49671,
>>> -0.1382,
>>> 0.64768,
>>> 1.52302,
>>> -0.2341,
>>> -17.2341,
>>> 1.57921,
>>> 0.76743,
>>> -0.4694,
>>> 0.54256]))
>>> ot = OutlierTrimmer(capping_method='mad', tail='left', fold=3)
>>> ot.fit(X)
>>> ot.transform(X)
x
0 0.49671
1 -0.13820
2 0.64768
3 1.52302
4 -0.23410
6 1.57921
7 0.76743
8 -0.46940
9 0.54256
"""

def transform(self, X: pd.DataFrame) -> pd.DataFrame:
Expand Down
42 changes: 42 additions & 0 deletions feature_engine/outliers/winsorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,48 @@ class Winsorizer(WinsorizerBase):
transform:
Cap the variables.
Examples
--------
>>> import numpy as np
>>> import pandas as pd
>>> from feature_engine.outliers import Winsorizer
>>> np.random.seed(42)
>>> X = pd.DataFrame(dict(x = np.random.normal(size = 10)))
>>> wz = Winsorizer(capping_method='mad', tail='both', fold=3)
>>> wz.fit(X)
>>> wz.transform(X)
x
0 0.496714
1 -0.138264
2 0.647689
3 1.523030
4 -0.234153
5 -0.234137
6 1.579213
7 0.767435
8 -0.469474
9 0.542560
>>> import numpy as np
>>> import pandas as pd
>>> from feature_engine.outliers import Winsorizer
>>> np.random.seed(42)
>>> X = pd.DataFrame(dict(x = np.random.normal(size = 10)))
>>> wz = Winsorizer(capping_method='mad', tail='both', fold=3)
>>> wz.fit(X)
>>> wz.transform(X)
x
0 0.496714
1 -0.138264
2 0.647689
3 1.523030
4 -0.234153
5 -0.234137
6 1.579213
7 0.767435
8 -0.469474
9 0.542560
"""

def __init__(
Expand Down
21 changes: 21 additions & 0 deletions feature_engine/preprocessing/match_categories.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,27 @@ class MatchCategories(
transform:
Enforce the type of categorical variables as dtype `categorical`.
Examples
--------
>>> import pandas as pd
>>> from feature_engine.preprocessing import MatchCategories
>>> X_train = pd.DataFrame(dict(x1 = ["a","b","c"], x2 = [4,5,6]))
>>> X_test = pd.DataFrame(dict(x1 = ["c","b","a","d"], x2 = [5,6,4,7]))
>>> mc = MatchCategories(missing_values="ignore")
>>> mc.fit(X_train)
>>> mc.transform(X_train)
x1 x2
0 a 4
1 b 5
2 c 6
>>> mc.transform(X_test)
x1 x2
0 c 5
1 b 6
2 a 4
3 NaN 7
"""

def __init__(
Expand Down
44 changes: 44 additions & 0 deletions feature_engine/preprocessing/match_columns.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,50 @@ class MatchVariables(BaseEstimator, TransformerMixin, GetFeatureNamesOutMixin):
transform:
Add or delete variables to match those observed in the train set.
Examples
--------
>>> import pandas as pd
>>> from feature_engine.preprocessing import MatchVariables
>>> X_train = pd.DataFrame(dict(x1 = ["a","b","c"], x2 = [4,5,6]))
>>> X_test = pd.DataFrame(dict(x1 = ["c","b","a","d"],
>>> x2 = [5,6,4,7],
>>> x3 = [1,1,1,1]))
>>> mv = MatchVariables(missing_values="ignore")
>>> mv.fit(X_train)
>>> mv.transform(X_train)
x1 x2
0 a 4
1 b 5
2 c 6
>>> mv.transform(X_test)
The following variables are dropped from the DataFrame: ['x3']
x1 x2
0 c 5
1 b 6
2 a 4
3 d 7
>>> import pandas as pd
>>> from feature_engine.preprocessing import MatchVariables
>>> X_train = pd.DataFrame(dict(x1 = ["a","b","c"],
>>> x2 = [4,5,6], x3 = [1,1,1]))
>>> X_test = pd.DataFrame(dict(x1 = ["c","b","a","d"], x2 = [5,6,4,7]))
>>> mv = MatchVariables(missing_values="ignore")
>>> mv.fit(X_train)
>>> mv.transform(X_train)
x1 x2 x3
0 a 4 1
1 b 5 1
2 c 6 1
>>> mv.transform(X_test)
The following variables are added to the DataFrame: ['x3']
x1 x2 x3
0 c 5 NaN
1 b 6 NaN
2 a 4 NaN
3 d 7 NaN
"""

def __init__(
Expand Down
22 changes: 22 additions & 0 deletions feature_engine/timeseries/forecasting/expanding_window_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,28 @@ class ExpandingWindowFeatures(BaseForecastTransformer):
pandas.expanding
pandas.aggregate
pandas.shift
Examples
--------
>>> import pandas as pd
>>> from feature_engine.timeseries.forecasting import ExpandingWindowFeatures
>>> X = pd.DataFrame(dict(date = ["2022-09-18",
>>> "2022-09-19",
>>> "2022-09-20",
>>> "2022-09-21",
>>> "2022-09-22"],
>>> x1 = [1,2,3,4,5],
>>> x2 = [6,7,8,9,10]
>>> ))
>>> ewf = ExpandingWindowFeatures()
>>> ewf.fit_transform(X)
date x1 x2 x1_expanding_mean x2_expanding_mean
0 2022-09-18 1 6 NaN NaN
1 2022-09-19 2 7 1.0 6.0
2 2022-09-20 3 8 1.5 6.5
3 2022-09-21 4 9 2.0 7.0
4 2022-09-22 5 10 2.5 7.5
"""

def __init__(
Expand Down
Loading

0 comments on commit ae009d8

Please sign in to comment.