FIX Fixes encoders for string dtypes #15763

thomasjpfan · 2019-12-03T02:03:53Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Adds support for string dtypes in the Encoders.

Any other comments?

This PR opens up 'U' dtypes to be encoded the same way as 'O' dtypes.

jnothman · 2019-12-03T02:35:33Z

Does this imply that we should introduce similar support elsewhere, e.g. SimpleImputer (and for categoricals in trees and Gower distances when they come)? Should we instead aim to support one right way of doing it, and raise a better error message?

thomasjpfan · 2019-12-03T03:55:21Z

Not supporting unicode and forcing users to use objects in their string processing seems a bit not user friendly. As seen in #15616 (comment), let's say we improve the error message:

ohe = OneHotEncoder(categories=[["cat2", "cat1"]])
ohe.fit([["cat2"], ["cat1"]])  # Works as expected

ohe.fit(np.array([["cat2"], ["cat1"]]))
# Raises "Either pass a python list or cast numpy array to object"?

In the other use case: #15726 (comment)

encoder = OneHotEncoder(categories=[['mse','mae']], sparse=False)
categories = np.array([['mse'], ['mae']], dtype='object')
encoder.fit(categories)

values = np.array([['mae'], ['mae'], ['mse'], ['mae'], ['mse'], ['mse']])
encoder.transform(values)
# Raises "Either pass a python list or cast numpy array to object"?

In both cases, this seems not user friendly.

From a maintainable point of view, it is most likely best to have error and only support object. Currently, I do not see any consequences for supporting both in the encoders, but they may come up.

rth · 2019-12-03T08:11:40Z

Not supporting unicode and forcing users to use objects in their string processing seems a bit not user friendly.

I have not followed the maintainability related discussions on supporting those so far, but for a few character strings, there is a ~50bytes overhead per PyObject containing a string. So a columns of string takes 3-5x more memory as object as opposed to numpy array of dtype U... That can be the difference between fitting in memory or not (and has significant run time impact).

However I imagine most users would use pandas dataframe as input, which for now doesn't support memory efficient (e.g. numpy) string arrays pandas-dev/pandas#8640 Also using pd.Categorical does solve some of the same issues, with the added bonus of additional functionality. So maybe focusing on supporting pd.Categorical (and documenting it as the preferred way) could be enough.

Though the fix here is straightforward enough and should be too problematic to maintain I think.

jnothman · 2019-12-03T08:47:32Z

Maybe we just need some tools to help us generate and test each of the cases

thomasjpfan · 2019-12-04T21:48:14Z

After speaking to @amueller, I think this is not the right fix because of how weird unicode can be:

x = np.array(['a', 'b'])
x[0] = 'hello'
x
# array(['h', 'b'], dtype='<U1')

In these cases we should convert the unicode dtypes to objects.

rth · 2019-12-04T22:12:11Z

In these cases we should convert the unicode dtypes to objects.

Except if it doesn't fit in memory while array of string does. But yes string array behavior can be potentially dangerous to unaware users.

thomasjpfan · 2019-12-04T22:18:35Z

For imputing, because of the string array behavior, it would be very unexpected to impute a string value and find that some of it is cut off.

Thanking about this more, for the encoders, it never modifies the original array, so this PR should be okay.

glemaitre · 2019-12-06T12:33:02Z

So if we start supporting these types, we should make it explicit in the documentation, even more, if we cannot support an array of unicode string in all estimators (data in imputers, column name in ColumnTransformer)

jnothman · 2020-04-19T15:16:10Z

Please add a what's new entry, @thomasjpfan

…strings

cmarmo · 2020-10-08T09:55:45Z

@thomasjpfan , one approval already: do you think this is worth to be in 0.24?

…strings

thomasjpfan · 2020-10-08T14:47:40Z

do you think this is worth to be in 0.24?

I think it would be nice to have because it resolves two issues.

glemaitre

LGTM.

sklearn/preprocessing/tests/test_encoders.py

glemaitre · 2020-10-21T10:21:19Z

@thomasjpfan Could you solve the merge conflicts as well.

glemaitre · 2020-10-21T10:41:40Z

I want just to illustrate that we would not be able to use this feature in a ColumnTransformer in case we have an estimator that does not support it (e.g. the SimpleImputer).

import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import FunctionTransformer

np.random.seed(0)

# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

categorical_features = ['embarked', 'sex', 'pclass']
X = X[categorical_features].values.astype(str)
categorical_features = [0, 1, 2]

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('converter', FunctionTransformer(func=func)),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-03641e4360e7> in <module>
     33 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
     34 
---> 35 clf.fit(X_train, y_train)

~/Documents/packages/scikit-learn/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    335         """
    336         fit_params_steps = self._check_fit_params(**fit_params)
--> 337         Xt = self._fit(X, y, **fit_params_steps)
    338         with _print_elapsed_time('Pipeline',
    339                                  self._log_message(len(self.steps) - 1)):

~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    297                 cloned_transformer = clone(transformer)
    298             # Fit or load from cache the current transformer
--> 299             X, fitted_transformer = fit_transform_one_cached(
    300                 cloned_transformer, X, y, None,
    301                 message_clsname='Pipeline',

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    749     with _print_elapsed_time(message_clsname, message):
    750         if hasattr(transformer, 'fit_transform'):
--> 751             res = transformer.fit_transform(X, y, **fit_params)
    752         else:
    753             res = transformer.fit(X, y, **fit_params).transform(X)

~/Documents/packages/scikit-learn/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
    536         self._validate_remainder(X)
    537 
--> 538         result = self._fit_transform(X, y, _fit_transform_one)
    539 
    540         if not result:

~/Documents/packages/scikit-learn/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
    463             self._iter(fitted=fitted, replace_strings=True))
    464         try:
--> 465             return Parallel(n_jobs=self.n_jobs)(
    466                 delayed(func)(
    467                     transformer=clone(trans) if not fitted else trans,

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1027             # remaining jobs.
   1028             self._iterating = False
-> 1029             if self.dispatch_one_batch(iterator):
   1030                 self._iterating = self._original_iterator is not None
   1031 

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    845                 return False
    846             else:
--> 847                 self._dispatch(tasks)
    848                 return True
    849 

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in _dispatch(self, batch)
    763         with self._lock:
    764             job_idx = len(self._jobs)
--> 765             job = self._backend.apply_async(batch, callback=cb)
    766             # A job can complete so quickly than its callback is
    767             # called before we get here, causing self._jobs to

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in __call__(self)
    250         # change the default number of processes to -1
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 252             return [func(*args, **kwargs)
    253                     for func, args, kwargs in self.items]
    254 

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0)
    250         # change the default number of processes to -1
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 252             return [func(*args, **kwargs)
    253                     for func, args, kwargs in self.items]
    254 

~/Documents/packages/scikit-learn/sklearn/utils/fixes.py in __call__(self, *args, **kwargs)
    220     def __call__(self, *args, **kwargs):
    221         with config_context(**self.config):
--> 222             return self.function(*args, **kwargs)

~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    749     with _print_elapsed_time(message_clsname, message):
    750         if hasattr(transformer, 'fit_transform'):
--> 751             res = transformer.fit_transform(X, y, **fit_params)
    752         else:
    753             res = transformer.fit(X, y, **fit_params).transform(X)

~/Documents/packages/scikit-learn/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    372         """
    373         fit_params_steps = self._check_fit_params(**fit_params)
--> 374         Xt = self._fit(X, y, **fit_params_steps)
    375 
    376         last_step = self._final_estimator

~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    297                 cloned_transformer = clone(transformer)
    298             # Fit or load from cache the current transformer
--> 299             X, fitted_transformer = fit_transform_one_cached(
    300                 cloned_transformer, X, y, None,
    301                 message_clsname='Pipeline',

~/miniconda3/envs/dev/lib/python3.8/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

~/Documents/packages/scikit-learn/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    749     with _print_elapsed_time(message_clsname, message):
    750         if hasattr(transformer, 'fit_transform'):
--> 751             res = transformer.fit_transform(X, y, **fit_params)
    752         else:
    753             res = transformer.fit(X, y, **fit_params).transform(X)

~/Documents/packages/scikit-learn/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    718         else:
    719             # fit method of arity 2 (supervised transformation)
--> 720             return self.fit(X, y, **fit_params).transform(X)
    721 
    722 

~/Documents/packages/scikit-learn/sklearn/impute/_base.py in fit(self, X, y)
    282         self : SimpleImputer
    283         """
--> 284         X = self._validate_input(X, in_fit=True)
    285 
    286         # default fill_value is 0 for numerical input and "missing_value"

~/Documents/packages/scikit-learn/sklearn/impute/_base.py in _validate_input(self, X, in_fit)
    260         _check_inputs_dtype(X, self.missing_values)
    261         if X.dtype.kind not in ("i", "u", "f", "O"):
--> 262             raise ValueError("SimpleImputer does not support data with dtype "
    263                              "{0}. Please provide either a numeric array (with"
    264                              " a floating point or integer dtype) or "

ValueError: SimpleImputer does not support data with dtype <U6. Please provide either a numeric array (with a floating point or integer dtype) or categorical data represented either as an array with integer dtype or an array of string values with an object dtype.

glemaitre · 2020-10-21T10:44:26Z

So it might be worth to merge this but we need to make all estimators supporting this dtype.

ogrisel · 2020-10-27T08:41:12Z

@rth said above:

I have not followed the maintainability related discussions on supporting those so far, but for a few character strings, there is a ~50bytes overhead per PyObject containing a string. So a columns of string takes 3-5x more memory as object as opposed to numpy array of dtype U... That can be the difference between fitting in memory or not (and has significant run time impact).

Actually, the numpy array does not store a Python object but a reference to a Python object. For categorical variables where there is a lot of repetition this will typically be only 8 bytes per entry (+ one PyObject per unique category, outside of the array itself). This is larger than dtype('<U1'), equal to dtype('<U2') but smaller than dtype('<U3'):

>>> import numpy as np
>>> X_obj = np.tile(np.array(["a", "b", "c"], dtype=object), int(1e6))
>>> X_obj.nbytes / 1e6
24.0
>>> X_obj.itemsize
8
>>> X_str = X_obj.astype(str)
>>> X_str.dtype
dtype('<U1')
>>> X_str.nbytes / 1e6
12.0
>>> X_str.itemsize
4

So as long as the cardinality is limited, using dtype=object for categorical variables is almost always more memory efficient than unicode (or even bytes) string dtypes.

However for free text data (e.g. first name / last names, street names, titles, descriptions...) where the entries are rarely or never repeated, then the unicode string might make sense as long as the length is homogeneous. If the length of one entry is 10x larger than the median length, then the padding incurred by the unicode string dtypes will also blow up the memory usage and object dtype would again be more memory efficient.

The only practical case where the numpy unicode / bytes string dtype could be valid would be to store fixed length string identifiers or hash digests in hexadecimal notation (as the hash size is typically larger than 64 bits). But those are typically useless variables in a machine learning context (because of the lack of repetition).

pandas 1.0 has an experimental variable length StringDType extension (https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) which looks optimal for (non-repeated) text valued columns in addition to CategoricalDType that is efficient for repeated string label values but numpy has neither.

That being said we could still change OneHotEncoder to accept unicode (or bytes) string dtypes for convenience, but the memory efficiency argument for numpy unicode dtype is almost never valid in datascience applications.

Co-authored-by: Guillaume Lemaitre <[email protected]>

ogrisel

I think we should definitely include a fix for the very confusing #15726. As unicode strings were not explicitly rejected in 0.23, let us do this quickfix which is a net improvement, even if the full scikit-learn code base (e.g. ColumnTransfomer) does not accept unicode strings consistently.

We can always decide later whether we want to add support for unicode string support to ColumnTransfomer for consistency or deprecate unicode string support in categorical values encoders in light of my comment above on dtypes. If we want better memory efficiency we might consider a deeper integration with pandas categorical but that sounds like a potentially significant maintenance effort.

ogrisel · 2020-10-27T09:17:29Z

I will fix the linting failure I caused by resolving the conflicts using the online github UI...

ogrisel · 2020-10-27T10:14:59Z

There is a more fundamental problem in my merge commit. I am investigating.

ogrisel · 2020-10-27T11:09:16Z

Merged. Thanks all!

Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>

BUG Fixes encoders for string dtypes

6a934ab

thomasjpfan added 2 commits December 2, 2019 22:33

TST Increases test surface

cccc633

STY

8dba3e1

jnothman mentioned this pull request Dec 4, 2019

OneHotEncoder generates a 2 and inconsistent data. #15726

Closed

thomasjpfan closed this Dec 4, 2019

thomasjpfan reopened this Dec 4, 2019

github-actions bot added the module:preprocessing label Mar 2, 2020

jnothman approved these changes Apr 19, 2020

View reviewed changes

thomasjpfan added 3 commits April 19, 2020 11:17

Merge remote-tracking branch 'upstream/master' into fix_encoder_with_…

edcadfb

…strings

DOC Adds whats new

062a81f

CLN Slight refactor

548516f

cmarmo added the Waiting for Reviewer label Oct 8, 2020

thomasjpfan added 3 commits October 8, 2020 10:33

Merge remote-tracking branch 'upstream/master' into fix_encoder_with_…

86426d3

…strings

ENH Adds support

73937b2

DOC Adds whats new

54ed731

thomasjpfan changed the title ~~[MRG] BUG Fixes encoders for string dtypes~~ [MRG] ENH Fixes encoders for string dtypes Oct 8, 2020

thomasjpfan changed the title ~~[MRG] ENH Fixes encoders for string dtypes~~ [MRG] FIX Fixes encoders for string dtypes Oct 8, 2020

DOC Adds whats new

56dea2c

thomasjpfan changed the title ~~[MRG] FIX Fixes encoders for string dtypes~~ FIX Fixes encoders for string dtypes Oct 8, 2020

cmarmo added this to the 0.24 milestone Oct 8, 2020

glemaitre reviewed Oct 21, 2020

View reviewed changes

sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved

glemaitre self-requested a review October 21, 2020 10:18

ogrisel and others added 2 commits October 27, 2020 09:55

Update sklearn/preprocessing/tests/test_encoders.py

fa4bd20

Co-authored-by: Guillaume Lemaitre <[email protected]>

Update sklearn/preprocessing/tests/test_encoders.py

adc9b4e

Co-authored-by: Guillaume Lemaitre <[email protected]>

ogrisel approved these changes Oct 27, 2020

View reviewed changes

Merge branch 'master' into fix_encoder_with_strings

db08502

cmarmo removed the Waiting for Reviewer label Oct 27, 2020

Remove duplicated import line

67c168c

ogrisel added 3 commits October 27, 2020 11:33

Fix bad merge artifact: test removed in scikit-learn#17317

83f4027

Consistently use explicitly dtyped categories in new test

6d513ac

Better changelog message

c9c8266

ogrisel merged commit 53add71 into scikit-learn:master Oct 27, 2020

thomasjpfan added a commit to thomasjpfan/scikit-learn that referenced this pull request Oct 28, 2020

FIX Fixes encoders for string dtypes (scikit-learn#15763)

8e22443

Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX Fixes encoders for string dtypes #15763

FIX Fixes encoders for string dtypes #15763

thomasjpfan commented Dec 3, 2019 •

edited

Loading

jnothman commented Dec 3, 2019 via email

thomasjpfan commented Dec 3, 2019

rth commented Dec 3, 2019 •

edited

Loading

jnothman commented Dec 3, 2019

thomasjpfan commented Dec 4, 2019 •

edited

Loading

rth commented Dec 4, 2019

thomasjpfan commented Dec 4, 2019 •

edited

Loading

glemaitre commented Dec 6, 2019

jnothman commented Apr 19, 2020

cmarmo commented Oct 8, 2020

thomasjpfan commented Oct 8, 2020

glemaitre left a comment

glemaitre commented Oct 21, 2020

glemaitre commented Oct 21, 2020 •

edited

Loading

glemaitre commented Oct 21, 2020

ogrisel commented Oct 27, 2020 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel commented Oct 27, 2020

ogrisel commented Oct 27, 2020

ogrisel commented Oct 27, 2020

FIX Fixes encoders for string dtypes #15763

FIX Fixes encoders for string dtypes #15763

Conversation

thomasjpfan commented Dec 3, 2019 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

jnothman commented Dec 3, 2019 via email

thomasjpfan commented Dec 3, 2019

rth commented Dec 3, 2019 • edited Loading

jnothman commented Dec 3, 2019

thomasjpfan commented Dec 4, 2019 • edited Loading

rth commented Dec 4, 2019

thomasjpfan commented Dec 4, 2019 • edited Loading

glemaitre commented Dec 6, 2019

jnothman commented Apr 19, 2020

cmarmo commented Oct 8, 2020

thomasjpfan commented Oct 8, 2020

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre commented Oct 21, 2020

glemaitre commented Oct 21, 2020 • edited Loading

glemaitre commented Oct 21, 2020

ogrisel commented Oct 27, 2020 • edited Loading

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

ogrisel commented Oct 27, 2020

ogrisel commented Oct 27, 2020

ogrisel commented Oct 27, 2020

thomasjpfan commented Dec 3, 2019 •

edited

Loading

rth commented Dec 3, 2019 •

edited

Loading

thomasjpfan commented Dec 4, 2019 •

edited

Loading

thomasjpfan commented Dec 4, 2019 •

edited

Loading

glemaitre commented Oct 21, 2020 •

edited

Loading

ogrisel commented Oct 27, 2020 •

edited

Loading

ogrisel left a comment •

edited

Loading