Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Faster categorical column names selection #4787

Merged
merged 5 commits into from
Nov 12, 2021
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions python-package/lightgbm/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@
import numpy as np
import scipy.sparse

from .compat import PANDAS_INSTALLED, concat, dt_DataTable, is_dtype_sparse, pd_DataFrame, pd_Series
from .compat import (PANDAS_INSTALLED, concat, dt_DataTable, is_dtype_sparse, pd_CategoricalDtype, pd_DataFrame,
pd_Series)
from .libpath import find_lib_path

ZERO_THRESHOLD = 1e-35
Expand Down Expand Up @@ -566,7 +567,7 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
raise ValueError('Input data must be 2 dimensional and non empty.')
if feature_name == 'auto' or feature_name is None:
data = data.rename(columns=str)
cat_cols = list(data.select_dtypes(include=['category']).columns)
cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, pd_CategoricalDtype)]
cat_cols_not_ordered = [col for col in cat_cols if not data[col].cat.ordered]
if pandas_categorical is None: # train dataset
pandas_categorical = [list(data[col].cat.categories) for col in cat_cols]
Expand Down
7 changes: 7 additions & 0 deletions python-package/lightgbm/compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from pandas import DataFrame as pd_DataFrame
from pandas import Series as pd_Series
from pandas import concat
from pandas.api.types import CategoricalDtype as pd_CategoricalDtype
Copy link
Collaborator

@StrikerRUS StrikerRUS Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine that this import works with the latest version. But will it make sense to try import like from pandas import CategoricalDtype as pd_CategoricalDtype in case they'll change internal structure of modules in the future?
Refer to

try:
from sklearn.exceptions import NotFittedError
from sklearn.model_selection import GroupKFold, StratifiedKFold
except ImportError:
from sklearn.cross_validation import GroupKFold, StratifiedKFold
from sklearn.utils.validation import NotFittedError

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this makes sense, thank you. Not sure, but it feels more likely that they are going to change pandas.api.types than the top-level import of their types from pandas import CategoricalDtype. Especially, given that their current top-level init goes into pandas.core.api for CategoricalDtype.

from pandas.api.types import is_sparse as is_dtype_sparse
PANDAS_INSTALLED = True
except ImportError:
Expand All @@ -23,6 +24,12 @@ class pd_DataFrame: # type: ignore
def __init__(self, *args, **kwargs):
pass

class pd_CategoricalDtype:
"""Dummy class for pandas.CategoricalDtype."""

def __init__(self, *args, **kwargs):
pass

concat = None
is_dtype_sparse = None

Expand Down