Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Faster categorical column names selection #4787

Merged
merged 5 commits into from
Nov 12, 2021

Conversation

Neronuser
Copy link
Contributor

  • Faster categorical column names selection

Change slow and redundant dataframe query by select_dtypes into a dataframe.dtypes list comprehension

* Faster categorical column names selection

Change slow and redundant dataframe query by select_dtypes into a dataframe.dtypes list comprehension
Copy link
Collaborator

@jmoralez jmoralez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @Neronuser! Since pandas isn't a hard dependency we use a compat module to import things from it. I've suggested the changes to the basic.py file and you'd have to add the required import in

from pandas.api.types import is_sparse as is_dtype_sparse

to add is_categorical_dtype and define it as None after
is_dtype_sparse = None

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
@@ -566,7 +567,7 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
raise ValueError('Input data must be 2 dimensional and non empty.')
if feature_name == 'auto' or feature_name is None:
data = data.rename(columns=str)
cat_cols = list(data.select_dtypes(include=['category']).columns)
cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, CategoricalDtype)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, CategoricalDtype)]
cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if is_categorical_dtype(dtype)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jmoralez! I updated the PR with required changes, but instead of importing is_categorical_dtype from pandas I create a Dummy CategoricalDtype object. This is done because is_categorical_dtype also checks if arrays have categorical dtype which introduces minor overhead for this case. Is that OK, or do you insist on using is_categorical_dtype?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ok. Thank you for the explanation!

@ghost
Copy link

ghost commented Nov 10, 2021

CLA assistant check
All CLA requirements met.

@jmoralez
Copy link
Collaborator

Thank you @Neronuser, looks good to me! Gently ping @StrikerRUS for a review as well.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for this improvement!
I also checked that CategoricalDtype has been available to be imported in a such way at least since pandas version 0.20 which was released in 2017. So, I believe we are good with backward compatibility.
https://github.com/pandas-dev/pandas/blob/0.20.x/pandas/api/types/__init__.py#L4

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was wrong in my previous comment. You can import CategoricalDtype directly from the root only since version 0.24. For previous versions you should specify the full path:

from pandas.api.types import CategoricalDtype

@@ -6,6 +6,7 @@
from pandas import DataFrame as pd_DataFrame
from pandas import Series as pd_Series
from pandas import concat
from pandas.api.types import CategoricalDtype as pd_CategoricalDtype
Copy link
Collaborator

@StrikerRUS StrikerRUS Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine that this import works with the latest version. But will it make sense to try import like from pandas import CategoricalDtype as pd_CategoricalDtype in case they'll change internal structure of modules in the future?
Refer to

try:
from sklearn.exceptions import NotFittedError
from sklearn.model_selection import GroupKFold, StratifiedKFold
except ImportError:
from sklearn.cross_validation import GroupKFold, StratifiedKFold
from sklearn.utils.validation import NotFittedError

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this makes sense, thank you. Not sure, but it feels more likely that they are going to change pandas.api.types than the top-level import of their types from pandas import CategoricalDtype. Especially, given that their current top-level init goes into pandas.core.api for CategoricalDtype.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you very much!

@StrikerRUS StrikerRUS changed the title Faster categorical column names selection (#1) [python] Faster categorical column names selection Nov 11, 2021
@shiyu1994 shiyu1994 merged commit 6cbb358 into microsoft:master Nov 12, 2021
@StrikerRUS StrikerRUS mentioned this pull request Jan 6, 2022
13 tasks
@jameslamb jameslamb mentioned this pull request Oct 7, 2022
40 tasks
@boris-saivahc
Copy link

I'd like to ask about the reason behind the merge of this change without subsequent release. I am currently dealing with a large dataset that consists of multiple categorical features. However, the implementation in version 3.3.5 results in an unnecessary increase in memory usage. It would greatly benefit me to have this change included in the released version.

@jameslamb
Copy link
Collaborator

Subscribe to #5153 to be notified of the next release.

There's nothing specific to this change keeping it out of releases...in general we have some challenges with maintainer availability in this project that have led to such a long delay between releases. We're trying to get a release out in the next few months, sorry for the inconvenience.

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants