Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMatrix fails on Pandas' nullable floats #8213

Closed
Ark-kun opened this issue Aug 31, 2022 · 1 comment · Fixed by #8262
Closed

DMatrix fails on Pandas' nullable floats #8213

Ark-kun opened this issue Aug 31, 2022 · 1 comment · Fixed by #8262

Comments

@Ark-kun
Copy link

Ark-kun commented Aug 31, 2022

XGBoost 1.6.1
Pandas: 1.4.1

Final training data information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   class                   10000 non-null  Int64   
 1   trip_seconds            9996 non-null   Int64   
 2   trip_miles              10000 non-null  Float64 
 3   pickup_community_area   9418 non-null   Int64   
 4   dropoff_community_area  9267 non-null   Int64   
 5   fare                    10000 non-null  Float64 
 6   tolls                   8920 non-null   Int64   
 7   extras                  10000 non-null  Float64 
 8   company                 10000 non-null  category
dtypes: Float64(3), Int64(5), category(1)
memory usage: 714.4 KB
Traceback (most recent call last):
  File "/tmp/tmp.ap5suk4c6n", line 116, in <module>
    _outputs = xgboost_train(**_parsed_args)
  File "/tmp/tmp.ap5suk4c6n", line 67, in xgboost_train
    training_data = xgboost.DMatrix(
  File "/usr/local/lib/python3.10/site-packages/xgboost/core.py", line 532, in inner_f
    return f(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/xgboost/core.py", line 643, in __init__
    handle, feature_names, feature_types = dispatch_data_backend(
  File "/usr/local/lib/python3.10/site-packages/xgboost/data.py", line 896, in dispatch_data_backend
    return _from_pandas_df(data, enable_categorical, missing, threads,
  File "/usr/local/lib/python3.10/site-packages/xgboost/data.py", line 345, in _from_pandas_df
    data, feature_names, feature_types = _transform_pandas_df(
  File "/usr/local/lib/python3.10/site-packages/xgboost/data.py", line 283, in _transform_pandas_df
    _invalid_dataframe_dtype(data)
  File "/usr/local/lib/python3.10/site-packages/xgboost/data.py", line 247, in _invalid_dataframe_dtype
    raise ValueError(msg)
ValueError: DataFrame.dtypes for data must be int, float, bool or category.  When
categorical type is supplied, DMatrix parameter `enable_categorical` must
be set to `True`. Invalid columns:trip_miles, fare, extras, company

#7760 only added support for nullable integers and booleans, but not floats.

P.S. Also the _invalid_dataframe_dtype incorrectly includes categorical columns despite enable_categorical set to True.

Ark-kun added a commit to Ark-kun/pipeline_components that referenced this issue Aug 31, 2022
@trivialfis
Copy link
Member

Thank you for raising the issue! Would you like to open a PR for the fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants