Merge branch 'feature-engine:main' into profiling_functionality

feature-engine · Apr 24, 2023 · ae009d8 · ae009d8
2 parents e98fe3c + feddb06
commit ae009d8
Show file tree

Hide file tree

Showing 24 changed files with 550 additions and 18 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -151,4 +151,4 @@ workflows:
           filters:
             branches:
               only:
-                - 1.5.X
+                - 1.6.X
diff --git a/docs/whats_new/index.rst b/docs/whats_new/index.rst
@@ -8,6 +8,7 @@ Find out what's new in each new version release.
 .. toctree::
    :maxdepth: 2
 
+   v_160
    v_150
    v_140
    v_130

diff --git a/docs/whats_new/v_160.rst b/docs/whats_new/v_160.rst
@@ -0,0 +1,94 @@
+Version 1.6.X
+=============
+
+Version 1.6.0
+-------------
+
+Deployed: 16th March 2023
+
+Contributors
+~~~~~~~~~~~~
+
+- `Gleb Levitski <https://github.com/GLevv>`_
+- `Morgan Sell <https://github.com/Morgan-Sell>`_
+- `Alfonso Tobar <https://github.com/datacubeR>`_
+- `Nodar Okroshiashvili <https://github.com/Okroshiashvili>`_
+- `Luís Seabra  <https://github.com/luismavs>`_
+- `Kyle Gilde <https://github.com/kylegilde>`_
+- `Soledad Galli <https://github.com/solegalli>`_
+
+In this release, we make Feature-engine transformers compatible with the `set_output`
+API from Scikit-learn, which was released in version 1.2.0. We also make Feature-engine
+compatible with the newest direction of pandas, in removing the `inplace` functionality
+that our transformers use under the hood.
+
+We introduce a major change: most of the **categorical encoders can now encode variables
+even if they have missing data**.
+
+We are also releasing **3 brand new transformers**: One for discretization, one for feature
+selection and one for operations between datetime variables.
+
+We also made a major improvement in the performance of the `DropDuplicateFeatures` and some
+smaller bug fixes here and there.
+
+We'd like to thank all contributors for fixing bugs and expanding the functionality
+and documentation of Feature-engine.
+
+Thank you so much to all contributors and to those of you who created issues flagging bugs or
+requesting new functionality.
+
+New transformers
+~~~~~~~~~~~~~~~~
+
+- **ProbeFeatureSelection**: introduces random features and selects variables whose importance is greater than the random ones (`Morgan Sell <https://github.com/Morgan-Sell>`_ and `Soledad Galli <https://github.com/solegalli>`_)
+- **DatetimeSubtraction**: creates new features by subtracting datetime variables (`Kyle Gilde <https://github.com/kylegilde>`_ and `Soledad Galli <https://github.com/solegalli>`_)
+- **GeometricWidthDiscretiser**: sorts continuous variables into intervals determined by geometric progression (`Gleb Levitski <https://github.com/GLevv>`_)
+
+New functionality
+~~~~~~~~~~~~~~~~~
+
+- Allow categorical encoders to encode variables with NaN (`Soledad Galli <https://github.com/solegalli>`_)
+- Make transformers compatible with new `set_output` functionality from sklearn (`Soledad Galli <https://github.com/solegalli>`_)
+- The `ArbitraryDiscretiser()` now includes the lowest limits in the intervals (`Soledad Galli <https://github.com/solegalli>`_)
+
+New modules
+~~~~~~~~~~~
+
+- New **Datasets** module with functions to load specific datasets (`Alfonso Tobar <https://github.com/datacubeR>`_)
+- New **variable_handling** module with functions to automatically select numerical, categorical, or datetime variables (`Soledad Galli <https://github.com/solegalli>`_)
+
+Bug fixes
+~~~~~~~~~
+
+- Fixed bug in `DropFeatures()` (`Luís Seabra  <https://github.com/luismavs>`_)
+- Fixed bug in `RecursiveFeatureElimination()` caused when only 1 feature remained in data (`Soledad Galli <https://github.com/solegalli>`_)
+
+Documentation
+~~~~~~~~~~~~~
+
+- Add example code snippets to the selection module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
+- Add example code snippets to the outlier module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
+- Add example code snippets to the transformation module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
+- Add example code snippets to the time series module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
+- Add example code snippets to the preprocessing module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
+- Add example code snippets to the wrapper module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
+- Updated documentation using new Dataset module (`Alfonso Tobar <https://github.com/datacubeR>`_ and `Soledad Galli <https://github.com/solegalli>`_)
+- Reorganized Readme badges (`Gleb Levitski <https://github.com/GLevv>`_)
+- New Jupyter notebooks for `GeometricWidthDiscretiser` (`Gleb Levitski <https://github.com/GLevv>`_)
+- Fixed typos (`Gleb Levitski <https://github.com/GLevv>`_)
+- Remove examples using the boston house dataset (`Soledad Galli <https://github.com/solegalli>`_)
+- Update sponsor page and contribute page (`Soledad Galli <https://github.com/solegalli>`_)
+
+
+Deprecations
+~~~~~~~~~~~~
+
+- The class `PRatioEncoder` is no longer supported and was removed from the API (`Soledad Galli <https://github.com/solegalli>`_)
+
+Code improvements
+~~~~~~~~~~~~~~~~~
+
+- Massive improvement in the performance (speed) of `DropDuplicateFeatures()` (`Nodar Okroshiashvili <https://github.com/Okroshiashvili>`_)
+- Remove `inplace` and other issues related to pandas new direction (`Luís Seabra  <https://github.com/luismavs>`_)
+- Move most docstrings to dedicated docstrings module  (`Soledad Galli <https://github.com/solegalli>`_)
+- Unnest tests for encoders (`Soledad Galli <https://github.com/solegalli>`_)
diff --git a/feature_engine/VERSION b/feature_engine/VERSION
@@ -1 +1 @@
-1.5.2
+1.6.0
diff --git a/feature_engine/datetime/datetime_subtraction.py b/feature_engine/datetime/datetime_subtraction.py
@@ -318,7 +318,7 @@ def _sub(self, dt_df: pd.DataFrame):
             new_df[new_varnames] = (
                 dt_df[self.variables_]
                 .sub(dt_df[reference], axis=0)
-                .apply(lambda s: s / np.timedelta64(1, self.output_unit))
+                .div(np.timedelta64(1, self.output_unit).astype("timedelta64[ns]"))
             )
 
         if self.new_variables_names is not None:

diff --git a/feature_engine/imputation/drop_missing_data.py b/feature_engine/imputation/drop_missing_data.py
@@ -205,7 +205,7 @@ def return_na_data(self, X: pd.DataFrame) -> pd.DataFrame:
             idx = pd.isnull(X[self.variables_]).mean(axis=1) >= self.threshold
             idx = idx[idx]
         else:
-            idx = pd.isnull(X[self.variables_]).any(1)
+            idx = pd.isnull(X[self.variables_]).any(axis=1)
             idx = idx[idx]
 
         return X.loc[idx.index, :]

diff --git a/feature_engine/outliers/artbitrary.py b/feature_engine/outliers/artbitrary.py
@@ -91,6 +91,27 @@ class ArbitraryOutlierCapper(BaseOutlier):
     transform:
         Cap the variables.
 
+    Examples
+    --------
+
+    >>> import pandas as pd
+    >>> from feature_engine.outliers import ArbitraryOutlierCapper
+    >>> X = pd.DataFrame(dict(x1 = [1,2,3,4,5,6,7,8,9,10]))
+    >>> aoc = ArbitraryOutlierCapper(max_capping_dict=dict(x1 =  8),
+    >>>                              min_capping_dict=dict(x1 = 2))
+    >>> aoc.fit(X)
+    >>> aoc.transform(X)
+       x1
+    0   2
+    1   2
+    2   3
+    3   4
+    4   5
+    5   6
+    6   7
+    7   8
+    8   8
+    9   8
     """
 
     def __init__(

diff --git a/feature_engine/outliers/trimmer.py b/feature_engine/outliers/trimmer.py
@@ -89,6 +89,61 @@ class OutlierTrimmer(WinsorizerBase):
     transform:
         Remove outliers.
 
+    Examples
+    --------
+
+    >>> import pandas as pd
+    >>> from feature_engine.outliers import OutlierTrimmer
+    >>> X = pd.DataFrame(dict(x = [0.49671,
+    >>>                         -0.1382,
+    >>>                          0.64768,
+    >>>                          1.52302,
+    >>>                         -0.2341,
+    >>>                         -17.2341,
+    >>>                          1.57921,
+    >>>                          0.76743,
+    >>>                         -0.4694,
+    >>>                          0.54256]))
+    >>> ot = OutlierTrimmer(capping_method='gaussian', tail='left', fold=3)
+    >>> ot.fit(X)
+    >>> ot.transform(X)
+              x
+    0   0.49671
+    1  -0.13820
+    2   0.64768
+    3   1.52302
+    4  -0.23410
+    5 -17.23410
+    6   1.57921
+    7   0.76743
+    8  -0.46940
+    9   0.54256
+
+    >>> import pandas as pd
+    >>> from feature_engine.outliers import OutlierTrimmer
+    >>> X = pd.DataFrame(dict(x = [0.49671,
+    >>>                         -0.1382,
+    >>>                          0.64768,
+    >>>                          1.52302,
+    >>>                         -0.2341,
+    >>>                         -17.2341,
+    >>>                          1.57921,
+    >>>                          0.76743,
+    >>>                         -0.4694,
+    >>>                          0.54256]))
+    >>> ot = OutlierTrimmer(capping_method='mad', tail='left', fold=3)
+    >>> ot.fit(X)
+    >>> ot.transform(X)
+             x
+    0  0.49671
+    1 -0.13820
+    2  0.64768
+    3  1.52302
+    4 -0.23410
+    6  1.57921
+    7  0.76743
+    8 -0.46940
+    9  0.54256
     """
 
     def transform(self, X: pd.DataFrame) -> pd.DataFrame:

diff --git a/feature_engine/outliers/winsorizer.py b/feature_engine/outliers/winsorizer.py
@@ -97,6 +97,48 @@ class Winsorizer(WinsorizerBase):
     transform:
         Cap the variables.
 
+    Examples
+    --------
+
+    >>> import numpy as np
+    >>> import pandas as pd
+    >>> from feature_engine.outliers import Winsorizer
+    >>> np.random.seed(42)
+    >>> X = pd.DataFrame(dict(x = np.random.normal(size = 10)))
+    >>> wz = Winsorizer(capping_method='mad', tail='both', fold=3)
+    >>> wz.fit(X)
+    >>> wz.transform(X)
+              x
+    0  0.496714
+    1 -0.138264
+    2  0.647689
+    3  1.523030
+    4 -0.234153
+    5 -0.234137
+    6  1.579213
+    7  0.767435
+    8 -0.469474
+    9  0.542560
+
+    >>> import numpy as np
+    >>> import pandas as pd
+    >>> from feature_engine.outliers import Winsorizer
+    >>> np.random.seed(42)
+    >>> X = pd.DataFrame(dict(x = np.random.normal(size = 10)))
+    >>> wz = Winsorizer(capping_method='mad', tail='both', fold=3)
+    >>> wz.fit(X)
+    >>> wz.transform(X)
+              x
+    0  0.496714
+    1 -0.138264
+    2  0.647689
+    3  1.523030
+    4 -0.234153
+    5 -0.234137
+    6  1.579213
+    7  0.767435
+    8 -0.469474
+    9  0.542560
     """
 
     def __init__(

diff --git a/feature_engine/preprocessing/match_categories.py b/feature_engine/preprocessing/match_categories.py
@@ -88,6 +88,27 @@ class MatchCategories(
 
     transform:
         Enforce the type of categorical variables as dtype `categorical`.
+
+    Examples
+    --------
+
+    >>> import pandas as pd
+    >>> from feature_engine.preprocessing import MatchCategories
+    >>> X_train = pd.DataFrame(dict(x1 = ["a","b","c"], x2 = [4,5,6]))
+    >>> X_test = pd.DataFrame(dict(x1 = ["c","b","a","d"], x2 = [5,6,4,7]))
+    >>> mc = MatchCategories(missing_values="ignore")
+    >>> mc.fit(X_train)
+    >>> mc.transform(X_train)
+      x1  x2
+    0  a   4
+    1  b   5
+    2  c   6
+    >>> mc.transform(X_test)
+        x1  x2
+    0    c   5
+    1    b   6
+    2    a   4
+    3  NaN   7
     """
 
     def __init__(

diff --git a/feature_engine/preprocessing/match_columns.py b/feature_engine/preprocessing/match_columns.py
@@ -100,6 +100,50 @@ class MatchVariables(BaseEstimator, TransformerMixin, GetFeatureNamesOutMixin):
 
     transform:
         Add or delete variables to match those observed in the train set.
+
+    Examples
+    --------
+
+    >>> import pandas as pd
+    >>> from feature_engine.preprocessing import MatchVariables
+    >>> X_train = pd.DataFrame(dict(x1 = ["a","b","c"], x2 = [4,5,6]))
+    >>> X_test = pd.DataFrame(dict(x1 = ["c","b","a","d"],
+    >>>                             x2 = [5,6,4,7],
+    >>>                             x3 = [1,1,1,1]))
+    >>> mv = MatchVariables(missing_values="ignore")
+    >>> mv.fit(X_train)
+    >>> mv.transform(X_train)
+    x1  x2
+    0  a   4
+    1  b   5
+    2  c   6
+    >>> mv.transform(X_test)
+    The following variables are dropped from the DataFrame: ['x3']
+      x1  x2
+    0  c   5
+    1  b   6
+    2  a   4
+    3  d   7
+
+    >>> import pandas as pd
+    >>> from feature_engine.preprocessing import MatchVariables
+    >>> X_train = pd.DataFrame(dict(x1 = ["a","b","c"],
+    >>>                             x2 = [4,5,6], x3 = [1,1,1]))
+    >>> X_test = pd.DataFrame(dict(x1 = ["c","b","a","d"], x2 = [5,6,4,7]))
+    >>> mv = MatchVariables(missing_values="ignore")
+    >>> mv.fit(X_train)
+    >>> mv.transform(X_train)
+      x1  x2  x3
+    0  a   4   1
+    1  b   5   1
+    2  c   6   1
+    >>> mv.transform(X_test)
+    The following variables are added to the DataFrame: ['x3']
+      x1  x2  x3
+    0  c   5 NaN
+    1  b   6 NaN
+    2  a   4 NaN
+    3  d   7 NaN
     """
 
     def __init__(

diff --git a/feature_engine/timeseries/forecasting/expanding_window_features.py b/feature_engine/timeseries/forecasting/expanding_window_features.py
@@ -117,6 +117,28 @@ class ExpandingWindowFeatures(BaseForecastTransformer):
     pandas.expanding
     pandas.aggregate
     pandas.shift
+
+    Examples
+    --------
+
+    >>> import pandas as pd
+    >>> from feature_engine.timeseries.forecasting import ExpandingWindowFeatures
+    >>> X = pd.DataFrame(dict(date = ["2022-09-18",
+    >>>                               "2022-09-19",
+    >>>                               "2022-09-20",
+    >>>                               "2022-09-21",
+    >>>                               "2022-09-22"],
+    >>>                       x1 = [1,2,3,4,5],
+    >>>                       x2 = [6,7,8,9,10]
+    >>>                     ))
+    >>> ewf = ExpandingWindowFeatures()
+    >>> ewf.fit_transform(X)
+             date  x1  x2  x1_expanding_mean  x2_expanding_mean
+    0  2022-09-18   1   6                NaN                NaN
+    1  2022-09-19   2   7                1.0                6.0
+    2  2022-09-20   3   8                1.5                6.5
+    3  2022-09-21   4   9                2.0                7.0
+    4  2022-09-22   5  10                2.5                7.5
     """
 
     def __init__(