Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] support pandas 2.0 #5739

Open
jameslamb opened this issue Feb 24, 2023 · 7 comments
Open

[python-package] support pandas 2.0 #5739

jameslamb opened this issue Feb 24, 2023 · 7 comments

Comments

@jameslamb
Copy link
Collaborator

Summary

This week, pandas developers announced a release candidate for the next major release, v2.0.

https://twitter.com/pandas_dev/status/1628159973988483074?s=20

We are happy to announce the release candidate of pandas 2.0.0.

It can be installed from our conda-forge and PyPI packages via mamba, conda or pip, for example:

mamba install -c conda-forge/label/pandas_rc pandas==2.0.0rc0
python -m pip install --upgrade --pre pandas==2.0.0rc0

I'm going to test lightgbm's latest release (v3.3.5) and latest source (f975d3f).

Opening this issue to publicize the fact that I'm doing that work, and to link to from any PRs generated as a result of that testing.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Apr 7, 2023

As reported in #5823, lightgbm 3.3.5 may not support pandas 2.0.

When trying to import lightgbm with pandas 2.0 installed, I got the error below. However, after downgrading to pandas 1.5.3, the problem was solved.
AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'

I can investigate this tonight or tomorrow.

@jameslamb
Copy link
Collaborator Author

@fermiyon I haven't been able to reproduce the error you mentioned yet. I suspect it's coming from one of lightgbm's dependencies and not lightgbm itself.

I can see that lightgbm doesn't directly reference pandas.core.strings.StringMethods.

# check on latest master
git grep StringMethod

# check on latest release (v3.3.5)
git checkout v3.3.5
git grep StringMethod

Could you please share some more details with us?

  • a full stack trace of the error you encountered
  • the results off pip freeze (or conda env export if using conda)

@jameslamb
Copy link
Collaborator Author

Alright, I was able to test lightgbm more thoroughly against pandas 2.0 tonight.

@fermiyon, I believe the error you've reported is coming from dask.dataframe.

short summary

lightgbm v3.3.5 (the latest version on PyPI) and the latest dev version here on GitHub (638014d) are both compatible with pandas 2.0.

dask<2023.2.0 is not compatible with pandas 2.0

If you encounter the following error when importing lightgbm...

AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'

... either remove dask from your environment ...

pip uninstall --yes \
    dask

... or upgrade it and distributed to a newer version

pip install \
    'dask>2023.3.2' \
    'distributed>2023.3.2'

Cause for that incompatibility:

How I tested this

investigation (click me)
docker build -t test-lgb:local -<<EOF
FROM python:3.10
RUN apt-get update \
  && apt-get install -y \
         build-essential \
         cmake \
         graphviz \
  && pip install --upgrade pip \
  && pip install \
         cloudpickle \
         graphviz \
         matplotlib \
         numpy \
         'pandas>=2.0' \
         psutil \
         pytest \
         scikit-learn \
         scipy
EOF

v3.3.5 of the Python package (no Dask)

# test v3.3.5 of the Python package
git checkout v3.3.5
docker run \
    --rm \
    -v $(pwd):/opt/LightGBM \
    -w /opt/LightGBM \
    -it test-lgb:local \
    bash -c "
        pip install 'lightgbm==3.3.5' && \
        pytest tests/python_package_test
        "
=== 246 passed, 28 skipped, 2 xfailed, 362 warnings in 65.38s (0:01:05) ===

v3.3.5 of the Python package (with latest Dask)

git checkout v3.3.5
docker run \
    --rm \
    -v $(pwd):/opt/LightGBM \
    -w /opt/LightGBM \
    -it test-lgb:local \
    bash -c "
        pip install 'dask==2023.3.2' 'distributed==2023.3.2.1' 'lightgbm==3.3.5' && \
        pytest tests/python_package_test
        "
=== 571 passed, 53 skipped, 2 xfailed, 644 warnings in 667.78s (0:11:07) ===

v3.3.5 of the Python package (with Dask 2022.12.1)

git checkout v3.3.5
docker run \
    --rm \
    -v $(pwd):/opt/LightGBM \
    -w /opt/LightGBM \
    -it test-lgb:local \
    bash -c "
        pip install 'dask==2023.1.1' 'distributed==2023.1.1' 'lightgbm==3.3.5' && \
        pytest tests/python_package_test
        "
_________________________ ERROR collecting tests/python_package_test/test_utilities.py _________________________
tests/python_package_test/test_utilities.py:6: in <module>
    import lightgbm as lgb
/usr/local/lib/python3.10/site-packages/lightgbm/__init__.py:8: in <module>
    from .basic import Booster, Dataset, Sequence, register_logger
/usr/local/lib/python3.10/site-packages/lightgbm/basic.py:20: in <module>
    from .compat import PANDAS_INSTALLED, concat, dt_DataTable, is_dtype_sparse, pd_DataFrame, pd_Series
/usr/local/lib/python3.10/site-packages/lightgbm/compat.py:130: in <module>
    from dask.dataframe import DataFrame as dask_DataFrame
/usr/local/lib/python3.10/site-packages/dask/dataframe/__init__.py:4: in <module>
    from dask.dataframe import backends, dispatch, rolling
/usr/local/lib/python3.10/site-packages/dask/dataframe/backends.py:22: in <module>
    from dask.dataframe.core import DataFrame, Index, Scalar, Series, _Frame
/usr/local/lib/python3.10/site-packages/dask/dataframe/core.py:35: in <module>
    from dask.dataframe import methods
/usr/local/lib/python3.10/site-packages/dask/dataframe/methods.py:22: in <module>
    from dask.dataframe.utils import is_dataframe_like, is_index_like, is_series_like
/usr/local/lib/python3.10/site-packages/dask/dataframe/utils.py:19: in <module>
    from dask.dataframe import (  # noqa: F401 register pandas extension types
/usr/local/lib/python3.10/site-packages/dask/dataframe/_dtypes.py:4: in <module>
    from dask.dataframe.extensions import make_array_nonempty, make_scalar
/usr/local/lib/python3.10/site-packages/dask/dataframe/extensions.py:6: in <module>
    from dask.dataframe.accessor import (
/usr/local/lib/python3.10/site-packages/dask/dataframe/accessor.py:190: in <module>
    class StringAccessor(Accessor):
/usr/local/lib/python3.10/site-packages/dask/dataframe/accessor.py:276: in StringAccessor
    pd.core.strings.StringMethods,
E   AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'

ERROR tests/python_package_test/test_basic.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_consistency.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_dask.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_dual.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_engine.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_plotting.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_sklearn.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_utilities.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'

latest master (no Dask)

Before running tests, compiled the C++ library one time to make the multiple re-installations of the Python library fast.

git checkout master
docker run \
    --rm \
    -v $(pwd):/opt/LightGBM \
    -w /opt/LightGBM \
    -it test-lgb:local \
    bash -c "rm -f ./lib_lightgbm.so && rm -rf ./build && mkdir ./build && cd ./build && cmake .. && make -j2"

Then ran the tests.

docker run \
    --rm \
    -v $(pwd):/opt/LightGBM \
    -w /opt/LightGBM \
    -it test-lgb:local \
    bash -c "
        cd ./python-package && \
        python setup.py install --precompile && \
        cd .. && \
        pytest tests/python_package_test
        "
=== 517 passed, 33 skipped, 2 xfailed, 122 warnings in 290.56s (0:04:50) ===

latest master (with latest Dask)

docker run \
    --rm \
    -v $(pwd):/opt/LightGBM \
    -w /opt/LightGBM \
    -it test-lgb:local \
    bash -c "
        cd ./python-package && \
        python setup.py install --precompile && \
        cd .. && \
        pip install 'dask==2023.3.2' 'distributed==2023.3.2.1' && \
        pytest tests/python_package_test
        "
=== 874 passed, 43 skipped, 2 xfailed, 435 warnings in 944.61s (0:15:44) ===

latest master (with Dask 2022.12.1)

docker run \
    --rm \
    -v $(pwd):/opt/LightGBM \
    -w /opt/LightGBM \
    -it test-lgb:local \
    bash -c "
        cd ./python-package && \
        python setup.py install --precompile && \
        cd .. && \
        pip install 'dask==2022.12.1' 'distributed==2022.12.1' && \
        pytest tests/python_package_test
        "
_________________________ ERROR collecting tests/python_package_test/test_utilities.py _________________________
tests/python_package_test/test_utilities.py:7: in <module>
    import lightgbm as lgb
/usr/local/lib/python3.10/site-packages/lightgbm/__init__.py:8: in <module>
    from .basic import Booster, Dataset, Sequence, register_logger
/usr/local/lib/python3.10/site-packages/lightgbm/basic.py:20: in <module>
    from .compat import PANDAS_INSTALLED, concat, dt_DataTable, pd_CategoricalDtype, pd_DataFrame, pd_Series
/usr/local/lib/python3.10/site-packages/lightgbm/compat.py:145: in <module>
    from dask.dataframe import DataFrame as dask_DataFrame
/usr/local/lib/python3.10/site-packages/dask/dataframe/__init__.py:4: in <module>
    from dask.dataframe import backends, dispatch, rolling
/usr/local/lib/python3.10/site-packages/dask/dataframe/backends.py:22: in <module>
    from dask.dataframe.core import DataFrame, Index, Scalar, Series, _Frame
/usr/local/lib/python3.10/site-packages/dask/dataframe/core.py:35: in <module>
    from dask.dataframe import methods
/usr/local/lib/python3.10/site-packages/dask/dataframe/methods.py:22: in <module>
    from dask.dataframe.utils import is_dataframe_like, is_index_like, is_series_like
/usr/local/lib/python3.10/site-packages/dask/dataframe/utils.py:19: in <module>
    from dask.dataframe import (  # noqa: F401 register pandas extension types
/usr/local/lib/python3.10/site-packages/dask/dataframe/_dtypes.py:4: in <module>
    from dask.dataframe.extensions import make_array_nonempty, make_scalar
/usr/local/lib/python3.10/site-packages/dask/dataframe/extensions.py:6: in <module>
    from dask.dataframe.accessor import (
/usr/local/lib/python3.10/site-packages/dask/dataframe/accessor.py:190: in <module>
    class StringAccessor(Accessor):
/usr/local/lib/python3.10/site-packages/dask/dataframe/accessor.py:276: in StringAccessor
    pd.core.strings.StringMethods,
E   AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'

=========================================== short test summary info ============================================
ERROR tests/python_package_test/test_basic.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_callback.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_consistency.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_dask.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_dual.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_engine.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_plotting.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_sklearn.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
ERROR tests/python_package_test/test_utilities.py - AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'

@fermiyon
Copy link

fermiyon commented Apr 8, 2023

short summary

Upgrading the Dask to version 2023.3.2 solved the issue.

Solution

Upgrading Dask to version 2023.3.2 resolved the issue as @jameslamb pointed out, I ran the following command in my terminal to upgrade Dask:

pip install --upgrade dask

After upgrading, I re-ran my code and the error was resolved.

I was using conda 23.3.1 environment, python 3.10.9, lightgbm 3.3.5, pandas 2.0.0 upgraded.

@jameslamb
Copy link
Collaborator Author

thanks for confirming @fermiyon !

Alright then based on the evidence I provided above, I think lightgbm (both v3.3.5 and latest master) are compatible with pandas 2.0.0, and I'm going to close this.

Anyone finding this issue from search who thinks they've found evidence of an incompatibility, please comment here with a reproducible example.

@fermiyon
Copy link

fermiyon commented Apr 12, 2023

short summary

It seems to me that LightGBM 3.3.5 doesn't support pandas 2.0 pyarrow data types

Code

import lightgbm as lgb
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier

# Load the Iris dataset as a pandas dataframe
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names, dtype='double[pyarrow]')
y = pd.Series(iris.target, dtype='int64[pyarrow]')

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set the hyperparameters for the model
params = {
    'objective': 'multiclass',
    'num_class': 3,
    'metric': 'multi_logloss',
    'verbose': 0,
    'random_state': 42
}

# Create the LGBMClassifier object and fit it to the training data
model = LGBMClassifier(**params)
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Calculate the accuracy score of the model
accuracy = accuracy_score(y_test, y_pred)

print('Accuracy:', accuracy)

Error

ValueError                                Traceback (most recent call last)
Cell In[23], line 27
     25 # Create the LGBMClassifier object and fit it to the training data
     26 model = LGBMClassifier(**params)
---> 27 model.fit(X_train, y_train)
     29 # Make predictions on the testing data
     30 y_pred = model.predict(X_test)

File [~/anaconda3/lib/python3.10/site-packages/lightgbm/sklearn.py:967], in LGBMClassifier.fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
    964         else:
    965             valid_sets[i] = (valid_x, self._le.transform(valid_y))
--> 967 super().fit(X, _y, sample_weight=sample_weight, init_score=init_score, eval_set=valid_sets,
    968             eval_names=eval_names, eval_sample_weight=eval_sample_weight,
    969             eval_class_weight=eval_class_weight, eval_init_score=eval_init_score,
    970             eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds,
    971             verbose=verbose, feature_name=feature_name, categorical_feature=categorical_feature,
    972             callbacks=callbacks, init_model=init_model)
    973 return self

File [~/anaconda3/lib/python3.10/site-packages/lightgbm/sklearn.py:748], in LGBMModel.fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
    745 evals_result = {}
    746 callbacks.append(record_evaluation(evals_result))
--> 748 self._Booster = train(
    749     params=params,
...
    597 data = data.values
    598 if data.dtype != np.float32 and data.dtype != np.float64:

ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in the following fields: sepal length (cm), sepal width (cm), petal length (cm), petal width (cm)

Environment

Python version:3.10
Pandas version: 2.0.0
OS: Mac OS Monterey 12.5
sklearn version:1.2.2
lightgbm version:3.3.5

@isque03
Copy link

isque03 commented Jul 30, 2023

Same issue continues in LightGBM 4.0.0

Python 3.9.12 | Pandas 2.0.3 | Numpy 1.22.4 | LightGBM 4.0.0

==
File ~/miniconda3/lib/python3.9/site-packages/lightgbm/basic.py:3096, in Booster.init(self, params, train_set, model_file, model_str)
3089 self.set_network(
3090 machines=machines,
3091 local_listen_port=params["local_listen_port"],
3092 listen_time_out=params.get("time_out", 120),
3093 num_machines=params["num_machines"]
3094 )
3095 # construct booster object
-> 3096 train_set.construct()
3097 # copy the parameters from train_set
...
--> 661 raise ValueError('pandas dtypes must be int, float or bool.\n'
662 f'Fields with bad pandas dtypes: {", ".join(bad_pandas_dtypes)}')

ValueError: pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: total: double[pyarrow], amount: double[pyarrow], qty: double[pyarrow], percent: double[pyarrow]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants