Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] OneHotEncoder support in sklearn Pipeline #3053

Closed
yasmina-altair opened this issue Oct 23, 2020 · 3 comments
Closed

[FEA] OneHotEncoder support in sklearn Pipeline #3053

yasmina-altair opened this issue Oct 23, 2020 · 3 comments
Assignees
Labels
Algorithm API Change For tracking changes to algorithms that might effect the API Cython / Python Cython or Python issue feature request New feature or request

Comments

@yasmina-altair
Copy link

Is your feature request related to a problem? Please describe.
I would like to use cuml.preprocessing.OneHotEncoder as part of Pipeline so that I can assemble additional steps and multiple preprocessors. However, OneHotEncoder alone in a Pipeline fails. For example:

import cudf
from cuml.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

X = cudf.DataFrame({'a': ["y", "y", "n", "n"]})
cat_transformer = Pipeline(steps=[("onehot", OneHotEncoder())])
cat_transformer.fit_transform(X[["a"]])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-862f3f5c6c73> in <module>
      7     ("onehot", OneHotEncoder())
      8 ])
----> 9 cat_transformer.fit_transform(X)

/opt/conda/envs/rapids/lib/python3.7/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    374             fit_params_last_step = fit_params_steps[self.steps[-1][0]]
    375             if hasattr(last_step, 'fit_transform'):
--> 376                 return last_step.fit_transform(Xt, y, **fit_params_last_step)
    377             else:
    378                 return last_step.fit(Xt, y,

TypeError: fit_transform() takes 2 positional arguments but 3 were given

Describe the solution you'd like
OneHotEncoder works with Pipeline similar to sklearn's OneHotEncoder:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

X = pd.DataFrame({'a': ["y", "y", "n", "n"]})
cat_transformer = Pipeline(steps=[("onehot", OneHotEncoder())])
cat_transformer.fit_transform(X)
---------------------------------------------------------------------------
<4x2 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

Describe alternatives you've considered
Omit using Pipeline with cuml OneHotEncoder.

Additional context
As a reference, I am using cuml version 0.16.0a+882.g5851f4140 and sklearn version 0.23.1.

@tfeher here's an example with a categorical transformer and sklearn's Pipeline. It would be great to have OneHotEncoder in a Pipeline, I don't see this error with some of the cuml experimental preprocessors.

@yasmina-altair yasmina-altair added ? - Needs Triage Need team to review and classify feature request New feature or request labels Oct 23, 2020
@viclafargue viclafargue added Algorithm API Change For tracking changes to algorithms that might effect the API Cython / Python Cython or Python issue and removed ? - Needs Triage Need team to review and classify labels Oct 23, 2020
@viclafargue viclafargue self-assigned this Oct 23, 2020
@viclafargue
Copy link
Contributor

Thank you for opening the issue. It looks like the fit_transform method of the OneHotEncoder class should be able to take a y parameter to be compatible with Scikit-Learn's Pipeline. This will be corrected.

@wphicks
Copy link
Contributor

wphicks commented Oct 29, 2020

Looks like this also affects LabelEncoder.

@viclafargue
Copy link
Contributor

Solved with #3192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithm API Change For tracking changes to algorithms that might effect the API Cython / Python Cython or Python issue feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants