[python] Faster categorical column names selection #4787

Neronuser · 2021-11-09T16:55:07Z

Faster categorical column names selection

Change slow and redundant dataframe query by select_dtypes into a dataframe.dtypes list comprehension

* Faster categorical column names selection Change slow and redundant dataframe query by select_dtypes into a dataframe.dtypes list comprehension

jmoralez

Thank you for your contribution @Neronuser! Since pandas isn't a hard dependency we use a compat module to import things from it. I've suggested the changes to the basic.py file and you'd have to add the required import in

LightGBM/python-package/lightgbm/compat.py

Line 9 in b1facf5

from pandas.api.types import is_sparse as is_dtype_sparse

to add is_categorical_dtype and define it as None after

LightGBM/python-package/lightgbm/compat.py

Line 27 in b1facf5

is_dtype_sparse = None

python-package/lightgbm/basic.py

jmoralez · 2021-11-10T01:12:22Z

python-package/lightgbm/basic.py

@@ -566,7 +567,7 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
            raise ValueError('Input data must be 2 dimensional and non empty.')
        if feature_name == 'auto' or feature_name is None:
            data = data.rename(columns=str)
-        cat_cols = list(data.select_dtypes(include=['category']).columns)
+        cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, CategoricalDtype)]


Suggested change

cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, CategoricalDtype)]

cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if is_categorical_dtype(dtype)]

Thanks @jmoralez! I updated the PR with required changes, but instead of importing is_categorical_dtype from pandas I create a Dummy CategoricalDtype object. This is done because is_categorical_dtype also checks if arrays have categorical dtype which introduces minor overhead for this case. Is that OK, or do you insist on using is_categorical_dtype?

That's ok. Thank you for the explanation!

ghost · 2021-11-10T07:22:40Z

All CLA requirements met.

jmoralez · 2021-11-10T15:25:11Z

Thank you @Neronuser, looks good to me! Gently ping @StrikerRUS for a review as well.

StrikerRUS

LGTM! Thanks for this improvement!
I also checked that CategoricalDtype has been available to be imported in a such way at least since pandas version 0.20 which was released in 2017. So, I believe we are good with backward compatibility.
https://github.com/pandas-dev/pandas/blob/0.20.x/pandas/api/types/__init__.py#L4

StrikerRUS

Sorry, I was wrong in my previous comment. You can import CategoricalDtype directly from the root only since version 0.24. For previous versions you should specify the full path:

from pandas.api.types import CategoricalDtype

StrikerRUS · 2021-11-11T00:54:20Z

python-package/lightgbm/compat.py

@@ -6,6 +6,7 @@
    from pandas import DataFrame as pd_DataFrame
    from pandas import Series as pd_Series
    from pandas import concat
+    from pandas.api.types import CategoricalDtype as pd_CategoricalDtype


This is fine that this import works with the latest version. But will it make sense to try import like from pandas import CategoricalDtype as pd_CategoricalDtype in case they'll change internal structure of modules in the future?
Refer to

LightGBM/python-package/lightgbm/compat.py

Lines 68 to 73 in 99e0a4b

try:

from sklearn.exceptions import NotFittedError

from sklearn.model_selection import GroupKFold, StratifiedKFold

except ImportError:

from sklearn.cross_validation import GroupKFold, StratifiedKFold

from sklearn.utils.validation import NotFittedError

WDYT?

Yes, this makes sense, thank you. Not sure, but it feels more likely that they are going to change pandas.api.types than the top-level import of their types from pandas import CategoricalDtype. Especially, given that their current top-level init goes into pandas.core.api for CategoricalDtype.

StrikerRUS

LGTM, thank you very much!

boris-saivahc · 2023-06-01T20:21:21Z

I'd like to ask about the reason behind the merge of this change without subsequent release. I am currently dealing with a large dataset that consists of multiple categorical features. However, the implementation in version 3.3.5 results in an unnecessary increase in memory usage. It would greatly benefit me to have this change included in the released version.

jameslamb · 2023-06-01T20:35:50Z

Subscribe to #5153 to be notified of the next release.

There's nothing specific to this change keeping it out of releases...in general we have some challenges with maintainer availability in this project that have led to such a long delay between releases. We're trying to get a release out in the next few months, sorry for the inconvenience.

github-actions · 2023-09-06T00:18:39Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

Faster categorical column names selection (#1)

20ee631

* Faster categorical column names selection Change slow and redundant dataframe query by select_dtypes into a dataframe.dtypes list comprehension

Neronuser requested review from chivee, henry0312, hzy46, jameslamb, shiyu1994, StrikerRUS and tongwu-sh as code owners November 9, 2021 16:55

jmoralez requested changes Nov 10, 2021

View reviewed changes

Update compat with CategoricalDtype

93d750b

sort imports

5a3611f

Neronuser requested a review from jmoralez November 10, 2021 15:14

jmoralez approved these changes Nov 10, 2021

View reviewed changes

StrikerRUS approved these changes Nov 10, 2021

View reviewed changes

StrikerRUS requested changes Nov 10, 2021

View reviewed changes

import CategoricalDtype from pandas.api.types

847f71d

Neronuser requested a review from StrikerRUS November 10, 2021 19:42

StrikerRUS reviewed Nov 11, 2021

View reviewed changes

add categorical import try/except

dcefc38

Neronuser requested a review from StrikerRUS November 11, 2021 07:32

StrikerRUS approved these changes Nov 11, 2021

View reviewed changes

StrikerRUS changed the title ~~Faster categorical column names selection (#1)~~ [python] Faster categorical column names selection Nov 11, 2021

StrikerRUS added the efficiency label Nov 11, 2021

shiyu1994 merged commit 6cbb358 into microsoft:master Nov 12, 2021

StrikerRUS mentioned this pull request Jan 6, 2022

[DO NOT MERGE] Release 3.3.2 #4930

Closed

13 tasks

jameslamb mentioned this pull request Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

github-actions bot locked as resolved and limited conversation to collaborators Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Faster categorical column names selection #4787

[python] Faster categorical column names selection #4787

Neronuser commented Nov 9, 2021

jmoralez left a comment

jmoralez Nov 10, 2021

Neronuser Nov 10, 2021

jmoralez Nov 10, 2021

ghost commented Nov 10, 2021 •

edited by ghost

Loading

jmoralez commented Nov 10, 2021

StrikerRUS left a comment

StrikerRUS left a comment

StrikerRUS Nov 11, 2021 •

edited

Loading

Neronuser Nov 11, 2021

StrikerRUS left a comment

boris-saivahc commented Jun 1, 2023

jameslamb commented Jun 1, 2023

github-actions bot commented Sep 6, 2023

	cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, CategoricalDtype)]
	cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if is_categorical_dtype(dtype)]

	try:
	from sklearn.exceptions import NotFittedError
	from sklearn.model_selection import GroupKFold, StratifiedKFold
	except ImportError:
	from sklearn.cross_validation import GroupKFold, StratifiedKFold
	from sklearn.utils.validation import NotFittedError

[python] Faster categorical column names selection #4787

[python] Faster categorical column names selection #4787

Conversation

Neronuser commented Nov 9, 2021

jmoralez left a comment

Choose a reason for hiding this comment

jmoralez Nov 10, 2021

Choose a reason for hiding this comment

Neronuser Nov 10, 2021

Choose a reason for hiding this comment

jmoralez Nov 10, 2021

Choose a reason for hiding this comment

ghost commented Nov 10, 2021 • edited by ghost Loading

jmoralez commented Nov 10, 2021

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS Nov 11, 2021 • edited Loading

Choose a reason for hiding this comment

Neronuser Nov 11, 2021

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

boris-saivahc commented Jun 1, 2023

jameslamb commented Jun 1, 2023

github-actions bot commented Sep 6, 2023

ghost commented Nov 10, 2021 •

edited by ghost

Loading

StrikerRUS Nov 11, 2021 •

edited

Loading