New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Math features migrate to numpy #774

Open

olikra wants to merge 20 commits into feature-engine:main from olikra:math_features_migrate_to_numpy

Contributor

olikra commented Jun 24, 2024

I migrated the transform method in math_features from pandas.agg to native numpy functions with backward compatibility. At least all existing tests on math-feature pass.

Following key-functions are implemented:

"sum", "np.sum"
"mean", "np.mean"
"min", "np.min"
"max", "np.max"
"prod", "np.prod"
"median", "np.median"
"std", "np.std"
"var", "np.var"

In case a defined function is not in above collection it falls back to pandas.agg. In the Examples-section I described the way to call a custom function with numpy.apply_over_axes()

I additionally changed the deprecated

dob_datrange = pd.date_range("2020-02-24", periods=4, freq="T")

to

dob_datrange = pd.date_range("2020-02-24", periods=4, freq="min")

to avoid FutureWarnings from pandas.

Please have a look. @solegalli

olikra added 10 commits

June 23, 2024 13:10


          add two tests for pandas.agg custom functions

22b03cf


          add two tests for pandas.agg custom functions

43ac1e1


          add custom function class to creation-folder

bce30b8


          add custom function class to creation-folder

3e1893c


          add test for median

f60867a


          add test for median and var

4fde225


          add doc for customfunction


          add customfunction

df9012f


          finalize test_math_features.py

be0e8bd


          finalize test_math_features.py

2152ab8

codecov bot commented Jun 24, 2024 •

edited

Loading

Codecov Report

Attention: Patch coverage is 98.50746% with 1 line in your changes missing coverage. Please review.

Project coverage is 97.53%. Comparing base (8cfab26) to head (63f26c9).

Files	Patch %	Lines
feature_engine/creation/math_features.py	98.41%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #774      +/-   ##
==========================================
+ Coverage   97.52%   97.53%   +0.01%     
==========================================
  Files         107      108       +1     
  Lines        4283     4341      +58     
  Branches      854      866      +12     
==========================================
+ Hits         4177     4234      +57     
  Misses         62       62              
- Partials       44       45       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

olikra added 5 commits

June 24, 2024 12:04


          remove section "if len(new_variables_names) != 1:" cause func is alwa…

ebba2c4

…ys an instance of list and the raise can never happen.


          remove section "feature_names = [f"{self.func}_{'_'.join(varlist)}"]"…

47fa4cb

… in _get_new_features_name cause func is always an instance of list and the else gets never reached. Code Section was never covered by tests.


          improve test coverage

2bfbc59


          check if func is a list not neccessary anymore

2b40910


          check if func is a list not neccessary anymore

d77d5ff

This was referenced Jun 25, 2024

MathFeatures seems much slower than pandas.sum() #576

Open

future warnings #693

Closed

olikra added 5 commits

June 26, 2024 17:58


          Change the way to analyze the functions in trasform to avoid the 4 wa…

2f4f6ab

…rning-messages


          added continue in transform for end current np_functions iteration

923c2d4


          add ddof to math_features init to allow complete controll about ddof …

2c92d43

…(pandas default of ddof is 1, numpy default of ddof is 0) default is 1 for backward compatability


          add ddof to math_features init to allow complete controll about ddof …

9c7cf82

…(pandas default of ddof is 1, numpy default of ddof is 0) default is 1 for backward compatability


          add testmethods for ddof

63f26c9

solegalli reviewed

View reviewed changes

Collaborator

solegalli left a comment

Thank you so much for moving forward with these changes. And apologies for the delayed response. I was first sick and then on a short holiday break.

I added some comments, I am not convinced we need the CustomFuntions class, and I wonder if there is a better way of implementing the functionality other than so many elifs.

Could we not do something like what we have in relative features:

feature_engine/feature_engine/creation/relative_features.py

Line 199 in 8cfab26

methods_dict = {

Would be great if you could look at the comments and let me know your thoughts. Thank you!

feature_engine/creation/custom_functions.py

		"""


		class CustomFunctions:

Collaborator

solegalli Jul 3, 2024

Do we need this class? it seems to be implementing a very simple functionality. Can this not be a one liner in the main MathFeatures class?

Besides that, the class name does not immediately tell what the class is doing, we'd need docstrings and typehints.

feature_engine/creation/math_features.py

@@ @@ -68,6 +69,11 @@ class MathFeatures(BaseCreation): @@
                       one name per function. If None, the transformer will assign arbitrary names,
                       starting with the function and followed by the variables separated by _.
+                  ddof: int, float, default = 1

Collaborator

solegalli Jul 3, 2024

can ddof ever be a float? the ddof are usually integers, no?

feature_engine/creation/math_features.py

@@ @@ -130,6 +136,54 @@ class MathFeatures(BaseCreation): @@
 1   4         2.5
 2   5         3.5
 3   6         4.5

Collaborator

solegalli Jul 3, 2024

All of this, should go in the user guide, not here. It is too much information for the class docstrings. But let's hold on with the change until we finalize the class update

feature_engine/creation/math_features.py

@@ @@ -139,8 +193,19 @@ def __init__( @@
                       new_variables_names: Optional[List[str]] = None,
                       missing_values: str = "raise",
                       drop_original: bool = False,
+                      ddof: Union[int, float] = 1,

Collaborator

solegalli Jul 3, 2024

I think ddof should be just int

feature_engine/creation/math_features.py

		) -> None:

		# casting input parameter func to a list

Collaborator

solegalli Jul 3, 2024

for compatibility with sklearn, we can't modify the parameters that the user enters at init. They need to remain the same. Anything that needs changing, needs to happen in the fit and be added as a new attribute followed by _ if necessary at all.

feature_engine/creation/math_features.py

+                      def np_transform(np_df, new_variable_names, np_variables, np_functions):
+                          np_result_df = pd.DataFrame()
+                          for np_function_idx, np_function in enumerate(np_functions):
+                              if callable(np_function):

Collaborator

solegalli Jul 3, 2024

if it is a callable, can we not apply it straightaway? is not getting the name and then the if loop adding complexity?

feature_engine/creation/math_features.py

+                              else:
+                                  np_function_name = np_function
+                              if np_function_name in ("sum"):

Collaborator

solegalli Jul 3, 2024

I think asserting equality is faster than scanning something in set, is that the case? did you check?

feature_engine/creation/math_features.py

+                                  np_result_df[new_variable_names[np_function_idx]] = pd.Series(
+                                      result
+                                  )
+                                  continue

Collaborator

solegalli Jul 3, 2024

why do we need continue here? so that it reads the following elif? should we not just use if then?

tests/test_creation/test_math_features.py

@@ @@ -4,6 +4,7 @@ @@
               from sklearn.pipeline import Pipeline
               from feature_engine.creation import MathFeatures
+              from feature_engine.creation.custom_functions import CustomFunctions

Collaborator

solegalli Jul 3, 2024

if we do keep this function, we need to test it in a separate test file. We have 1 test file per class/script

tests/test_creation/test_math_features.py

@@ @@ -39,6 +40,11 @@ def test_error_if_func_is_dictionary(): @@
                       MathFeatures(variables=["Age", "Name"], func={"A": "sum", "B": "mean"})
+              def test_error_if_ddof_is_not_int_or_float():
+                  with pytest.raises(ValueError):
+                      MathFeatures(variables=["Age", "Name"], func={"std", "var"}, ddof="A")

Collaborator

solegalli Jul 3, 2024

we also want to assert the error message is the expected one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet