Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PandasStandardScaler for HUGS #36

Open
sgoldenCS opened this issue Apr 26, 2024 · 3 comments · May be fixed by #38
Open

Make PandasStandardScaler for HUGS #36

sgoldenCS opened this issue Apr 26, 2024 · 3 comments · May be fixed by #38
Assignees
Labels
Milestone

Comments

@sgoldenCS
Copy link
Contributor

In order to keep track of each step of the process in preparing for HUGS, I am making a issue/branch for the next step (the data prep module). I think standard scaling is probably fine for our example data. This issue intends to complete one step of #34.

@sgoldenCS
Copy link
Contributor Author

A quick note: I will be developing the scaler here and then proceed to merge it into the branch for #34.

@sgoldenCS
Copy link
Contributor Author

I have nearly finished the implementation of a standard scaler. It should be quite robust and includes options for inplace scaling, axis arguments, and a tunable epsilon that controls if a feature has enough variance to be scaled by the standard deviation.

I have written unit tests to check the majority of options. I still need to check the reverse scaling.

I have a bunch of extra functions based on the syntax from scikit-learn (fit, transform, inverse_transform) and simply call them using the interface defined in the core. I'm open to having the core interface renamed, but it's not super important.

Some important things to note:

  • The implementation is done mostly by hand to avoid issues with scikit-learn and Pandas.
    • Pandas does not allow for axis==None with variance (it produces a deprecation warning and treats it like axis==0)
    • Pandas' implementation of variance assumes sample variance (degrees of freedom == 1, so we divide by N-1). This differs from numpy
    • scikit-learn StandardScaler doesn't allow for any axis parameters
  • scikit-learn does some interesting things with output types. They implement a registered set of converters for Pandas and Polars (see https://github.com/scikit-learn/scikit-learn/blob/8721245511de2f225ff5f9aa5f5fadce663cd4a3/sklearn/utils/_set_output.py#L183)
  • It's possible that this code works for a few different input datatypes but I haven't tested anything other than Pandas. I believe that the main issue is the current implementation for inplace == False assumes the input is Pandas. Anything that operates like a numpy array should work fine until the output conversion.

@sgoldenCS
Copy link
Contributor Author

I've written the code for reverse scaling and the associated unit test. I think this can be made into a pull request now.

@Kishanrajput Kishanrajput added this to the HUGS milestone May 1, 2024
@sgoldenCS sgoldenCS linked a pull request May 9, 2024 that will close this issue
@sgoldenCS sgoldenCS linked a pull request May 9, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants