Make PandasStandardScaler for HUGS #36

sgoldenCS · 2024-04-26T19:41:53Z

In order to keep track of each step of the process in preparing for HUGS, I am making a issue/branch for the next step (the data prep module). I think standard scaling is probably fine for our example data. This issue intends to complete one step of #34.

sgoldenCS · 2024-04-26T19:43:46Z

A quick note: I will be developing the scaler here and then proceed to merge it into the branch for #34.

sgoldenCS · 2024-04-29T19:09:30Z

I have nearly finished the implementation of a standard scaler. It should be quite robust and includes options for inplace scaling, axis arguments, and a tunable epsilon that controls if a feature has enough variance to be scaled by the standard deviation.

I have written unit tests to check the majority of options. I still need to check the reverse scaling.

I have a bunch of extra functions based on the syntax from scikit-learn (fit, transform, inverse_transform) and simply call them using the interface defined in the core. I'm open to having the core interface renamed, but it's not super important.

Some important things to note:

The implementation is done mostly by hand to avoid issues with scikit-learn and Pandas.
- Pandas does not allow for axis==None with variance (it produces a deprecation warning and treats it like axis==0)
- Pandas' implementation of variance assumes sample variance (degrees of freedom == 1, so we divide by N-1). This differs from numpy
- scikit-learn StandardScaler doesn't allow for any axis parameters
scikit-learn does some interesting things with output types. They implement a registered set of converters for Pandas and Polars (see https://github.com/scikit-learn/scikit-learn/blob/8721245511de2f225ff5f9aa5f5fadce663cd4a3/sklearn/utils/_set_output.py#L183)
It's possible that this code works for a few different input datatypes but I haven't tested anything other than Pandas. I believe that the main issue is the current implementation for inplace == False assumes the input is Pandas. Anything that operates like a numpy array should work fine until the output conversion.

sgoldenCS · 2024-05-01T19:16:11Z

I've written the code for reverse scaling and the associated unit test. I think this can be made into a pull request now.

sgoldenCS added the dev label Apr 26, 2024

sgoldenCS assigned dlersch and sgoldenCS Apr 26, 2024

sgoldenCS assigned schr476 Apr 29, 2024

Kishanrajput added this to the HUGS milestone May 1, 2024

sgoldenCS linked a pull request May 9, 2024 that will close this issue

36 make pandasstandardscaler for hugs #38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make PandasStandardScaler for HUGS #36

Make PandasStandardScaler for HUGS #36

sgoldenCS commented Apr 26, 2024

sgoldenCS commented Apr 26, 2024

sgoldenCS commented Apr 29, 2024

sgoldenCS commented May 1, 2024

Make PandasStandardScaler for HUGS #36

Make PandasStandardScaler for HUGS #36

Comments

sgoldenCS commented Apr 26, 2024

sgoldenCS commented Apr 26, 2024

sgoldenCS commented Apr 29, 2024

sgoldenCS commented May 1, 2024