Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Fixes encoders for string dtypes #15763

Merged
merged 17 commits into from
Oct 27, 2020

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Dec 3, 2019

Reference Issues/PRs

Fixes #15616
Fixes #15726

What does this implement/fix? Explain your changes.

Adds support for string dtypes in the Encoders.

Any other comments?

This PR opens up 'U' dtypes to be encoded the same way as 'O' dtypes.

@jnothman
Copy link
Member

jnothman commented Dec 3, 2019 via email

@thomasjpfan
Copy link
Member Author

Not supporting unicode and forcing users to use objects in their string processing seems a bit not user friendly. As seen in #15616 (comment), let's say we improve the error message:

ohe = OneHotEncoder(categories=[["cat2", "cat1"]])
ohe.fit([["cat2"], ["cat1"]])  # Works as expected

ohe.fit(np.array([["cat2"], ["cat1"]]))
# Raises "Either pass a python list or cast numpy array to object"?

In the other use case: #15726 (comment)

encoder = OneHotEncoder(categories=[['mse','mae']], sparse=False)
categories = np.array([['mse'], ['mae']], dtype='object')
encoder.fit(categories)

values = np.array([['mae'], ['mae'], ['mse'], ['mae'], ['mse'], ['mse']])
encoder.transform(values)
# Raises "Either pass a python list or cast numpy array to object"?

In both cases, this seems not user friendly.

From a maintainable point of view, it is most likely best to have error and only support object. Currently, I do not see any consequences for supporting both in the encoders, but they may come up.

@rth
Copy link
Member

rth commented Dec 3, 2019

Not supporting unicode and forcing users to use objects in their string processing seems a bit not user friendly.

I have not followed the maintainability related discussions on supporting those so far, but for a few character strings, there is a ~50bytes overhead per PyObject containing a string. So a columns of string takes 3-5x more memory as object as opposed to numpy array of dtype U... That can be the difference between fitting in memory or not (and has significant run time impact).

However I imagine most users would use pandas dataframe as input, which for now doesn't support memory efficient (e.g. numpy) string arrays pandas-dev/pandas#8640 Also using pd.Categorical does solve some of the same issues, with the added bonus of additional functionality. So maybe focusing on supporting pd.Categorical (and documenting it as the preferred way) could be enough.

Though the fix here is straightforward enough and should be too problematic to maintain I think.

@jnothman
Copy link
Member

jnothman commented Dec 3, 2019

Maybe we just need some tools to help us generate and test each of the cases

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Dec 4, 2019

After speaking to @amueller, I think this is not the right fix because of how weird unicode can be:

x = np.array(['a', 'b'])
x[0] = 'hello'
x
# array(['h', 'b'], dtype='<U1')

In these cases we should convert the unicode dtypes to objects.

@thomasjpfan thomasjpfan closed this Dec 4, 2019
@rth
Copy link
Member

rth commented Dec 4, 2019

In these cases we should convert the unicode dtypes to objects.

Except if it doesn't fit in memory while array of string does. But yes string array behavior can be potentially dangerous to unaware users.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Dec 4, 2019

For imputing, because of the string array behavior, it would be very unexpected to impute a string value and find that some of it is cut off.

Thanking about this more, for the encoders, it never modifies the original array, so this PR should be okay.

@thomasjpfan thomasjpfan reopened this Dec 4, 2019
@glemaitre
Copy link
Member

So if we start supporting these types, we should make it explicit in the documentation, even more, if we cannot support an array of unicode string in all estimators (data in imputers, column name in ColumnTransformer)

@jnothman
Copy link
Member

Please add a what's new entry, @thomasjpfan

@cmarmo
Copy link
Contributor

cmarmo commented Oct 8, 2020

@thomasjpfan , one approval already: do you think this is worth to be in 0.24?

@thomasjpfan thomasjpfan changed the title [MRG] BUG Fixes encoders for string dtypes [MRG] ENH Fixes encoders for string dtypes Oct 8, 2020
@thomasjpfan thomasjpfan changed the title [MRG] ENH Fixes encoders for string dtypes [MRG] FIX Fixes encoders for string dtypes Oct 8, 2020
@thomasjpfan thomasjpfan changed the title [MRG] FIX Fixes encoders for string dtypes FIX Fixes encoders for string dtypes Oct 8, 2020
@thomasjpfan
Copy link
Member Author

do you think this is worth to be in 0.24?

I think it would be nice to have because it resolves two issues.

@cmarmo cmarmo added this to the 0.24 milestone Oct 8, 2020
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved
@glemaitre glemaitre self-requested a review October 21, 2020 10:18
@glemaitre
Copy link
Member

@thomasjpfan Could you solve the merge conflicts as well.

@glemaitre
Copy link
Member

glemaitre commented Oct 21, 2020

I want just to illustrate that we would not be able to use this feature in a ColumnTransformer in case we have an estimator that does not support it (e.g. the SimpleImputer).

import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import FunctionTransformer

np.random.seed(0)

# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

categorical_features = ['embarked', 'sex', 'pclass']
X = X[categorical_features].values.astype(str)
categorical_features = [0, 1, 2]

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('converter', FunctionTransformer(func=func)),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-03641e4360e7> in <module>
     33 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
     34 
---> 35 clf.fit(X_train, y_train)

~/Documents/packages/scikit-learn/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    335         """
    336         fit_params_steps = self._check_fit_params(**fit_params)
--> 337         Xt = self._fit(X, y, **fit_params_steps)
    338         with _print_elapsed_time('Pipeline',
    339                                  self._log_message(len(self.steps) - 1)):

~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    297                 cloned_transformer = clone(transformer)
    298             # Fit or load from cache the current transformer
--> 299             X, fitted_transformer = fit_transform_one_cached(
    300                 cloned_transformer, X, y, None,
    301                 message_clsname='Pipeline',

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    749     with _print_elapsed_time(message_clsname, message):
    750         if hasattr(transformer, 'fit_transform'):
--> 751             res = transformer.fit_transform(X, y, **fit_params)
    752         else:
    753             res = transformer.fit(X, y, **fit_params).transform(X)

~/Documents/packages/scikit-learn/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
    536         self._validate_remainder(X)
    537 
--> 538         result = self._fit_transform(X, y, _fit_transform_one)
    539 
    540         if not result:

~/Documents/packages/scikit-learn/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
    463             self._iter(fitted=fitted, replace_strings=True))
    464         try:
--> 465             return Parallel(n_jobs=self.n_jobs)(
    466                 delayed(func)(
    467                     transformer=clone(trans) if not fitted else trans,

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1027             # remaining jobs.
   1028             self._iterating = False
-> 1029             if self.dispatch_one_batch(iterator):
   1030                 self._iterating = self._original_iterator is not None
   1031 

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    845                 return False
    846             else:
--> 847                 self._dispatch(tasks)
    848                 return True
    849 

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in _dispatch(self, batch)
    763         with self._lock:
    764             job_idx = len(self._jobs)
--> 765             job = self._backend.apply_async(batch, callback=cb)
    766             # A job can complete so quickly than its callback is
    767             # called before we get here, causing self._jobs to

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in __call__(self)
    250         # change the default number of processes to -1
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 252             return [func(*args, **kwargs)
    253                     for func, args, kwargs in self.items]
    254 

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0)
    250         # change the default number of processes to -1
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 252             return [func(*args, **kwargs)
    253                     for func, args, kwargs in self.items]
    254 

~/Documents/packages/scikit-learn/sklearn/utils/fixes.py in __call__(self, *args, **kwargs)
    220     def __call__(self, *args, **kwargs):
    221         with config_context(**self.config):
--> 222             return self.function(*args, **kwargs)

~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    749     with _print_elapsed_time(message_clsname, message):
    750         if hasattr(transformer, 'fit_transform'):
--> 751             res = transformer.fit_transform(X, y, **fit_params)
    752         else:
    753             res = transformer.fit(X, y, **fit_params).transform(X)

~/Documents/packages/scikit-learn/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    372         """
    373         fit_params_steps = self._check_fit_params(**fit_params)
--> 374         Xt = self._fit(X, y, **fit_params_steps)
    375 
    376         last_step = self._final_estimator

~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    297                 cloned_transformer = clone(transformer)
    298             # Fit or load from cache the current transformer
--> 299             X, fitted_transformer = fit_transform_one_cached(
    300                 cloned_transformer, X, y, None,
    301                 message_clsname='Pipeline',

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    749     with _print_elapsed_time(message_clsname, message):
    750         if hasattr(transformer, 'fit_transform'):
--> 751             res = transformer.fit_transform(X, y, **fit_params)
    752         else:
    753             res = transformer.fit(X, y, **fit_params).transform(X)

~/Documents/packages/scikit-learn/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    718         else:
    719             # fit method of arity 2 (supervised transformation)
--> 720             return self.fit(X, y, **fit_params).transform(X)
    721 
    722 

~/Documents/packages/scikit-learn/sklearn/impute/_base.py in fit(self, X, y)
    282         self : SimpleImputer
    283         """
--> 284         X = self._validate_input(X, in_fit=True)
    285 
    286         # default fill_value is 0 for numerical input and "missing_value"

~/Documents/packages/scikit-learn/sklearn/impute/_base.py in _validate_input(self, X, in_fit)
    260         _check_inputs_dtype(X, self.missing_values)
    261         if X.dtype.kind not in ("i", "u", "f", "O"):
--> 262             raise ValueError("SimpleImputer does not support data with dtype "
    263                              "{0}. Please provide either a numeric array (with"
    264                              " a floating point or integer dtype) or "

ValueError: SimpleImputer does not support data with dtype <U6. Please provide either a numeric array (with a floating point or integer dtype) or categorical data represented either as an array with integer dtype or an array of string values with an object dtype.

@glemaitre
Copy link
Member

So it might be worth to merge this but we need to make all estimators supporting this dtype.

@ogrisel
Copy link
Member

ogrisel commented Oct 27, 2020

@rth said above:

I have not followed the maintainability related discussions on supporting those so far, but for a few character strings, there is a ~50bytes overhead per PyObject containing a string. So a columns of string takes 3-5x more memory as object as opposed to numpy array of dtype U... That can be the difference between fitting in memory or not (and has significant run time impact).

Actually, the numpy array does not store a Python object but a reference to a Python object. For categorical variables where there is a lot of repetition this will typically be only 8 bytes per entry (+ one PyObject per unique category, outside of the array itself). This is larger than dtype('<U1'), equal to dtype('<U2') but smaller than dtype('<U3'):

>>> import numpy as np
>>> X_obj = np.tile(np.array(["a", "b", "c"], dtype=object), int(1e6))
>>> X_obj.nbytes / 1e6
24.0
>>> X_obj.itemsize
8
>>> X_str = X_obj.astype(str)
>>> X_str.dtype
dtype('<U1')
>>> X_str.nbytes / 1e6
12.0
>>> X_str.itemsize
4

So as long as the cardinality is limited, using dtype=object for categorical variables is almost always more memory efficient than unicode (or even bytes) string dtypes.

However for free text data (e.g. first name / last names, street names, titles, descriptions...) where the entries are rarely or never repeated, then the unicode string might make sense as long as the length is homogeneous. If the length of one entry is 10x larger than the median length, then the padding incurred by the unicode string dtypes will also blow up the memory usage and object dtype would again be more memory efficient.

The only practical case where the numpy unicode / bytes string dtype could be valid would be to store fixed length string identifiers or hash digests in hexadecimal notation (as the hash size is typically larger than 64 bits). But those are typically useless variables in a machine learning context (because of the lack of repetition).

pandas 1.0 has an experimental variable length StringDType extension (https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) which looks optimal for (non-repeated) text valued columns in addition to CategoricalDType that is efficient for repeated string label values but numpy has neither.

That being said we could still change OneHotEncoder to accept unicode (or bytes) string dtypes for convenience, but the memory efficiency argument for numpy unicode dtype is almost never valid in datascience applications.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should definitely include a fix for the very confusing #15726. As unicode strings were not explicitly rejected in 0.23, let us do this quickfix which is a net improvement, even if the full scikit-learn code base (e.g. ColumnTransfomer) does not accept unicode strings consistently.

We can always decide later whether we want to add support for unicode string support to ColumnTransfomer for consistency or deprecate unicode string support in categorical values encoders in light of my comment above on dtypes. If we want better memory efficiency we might consider a deeper integration with pandas categorical but that sounds like a potentially significant maintenance effort.

@ogrisel
Copy link
Member

ogrisel commented Oct 27, 2020

I will fix the linting failure I caused by resolving the conflicts using the online github UI...

@ogrisel
Copy link
Member

ogrisel commented Oct 27, 2020

There is a more fundamental problem in my merge commit. I am investigating.

@ogrisel ogrisel merged commit 53add71 into scikit-learn:master Oct 27, 2020
@ogrisel
Copy link
Member

ogrisel commented Oct 27, 2020

Merged. Thanks all!

thomasjpfan added a commit to thomasjpfan/scikit-learn that referenced this pull request Oct 28, 2020
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants