BUG in GPU histogram #1003

lorenzoridolfi · 2017-10-20T19:03:23Z

Environment info

Operating System: Fedora 26
CPU: I5
GPU: NVidia GTX 1060
C++/Python/R version:
Python 3.6.2
Cuda 9.0

Error Message:

[LightGBM] [Info] Number of positive: 17355, number of negative: 458814
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476169, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.048936 secs. 9 sparse feature groups.
[LightGBM] [Info] Number of positive: 17355, number of negative: 458814
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476169, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.048049 secs. 9 sparse feature groups.
[LightGBM] [Info] Number of positive: 17355, number of negative: 458814
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476169, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.039569 secs. 9 sparse feature groups.
[LightGBM] [Info] Number of positive: 17355, number of negative: 458815
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476170, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.035209 secs. 9 sparse feature groups.
[LightGBM] [Info] Number of positive: 17356, number of negative: 458815
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476171, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.040315 secs. 9 sparse feature groups.
[LightGBM] [Fatal] Bug in GPU histogram! split 8211: 11359, smaller_leaf: 9610, larger_leaf: 9960

Traceback (most recent call last):
File "lightgbm_param.py", line 127, in
main()
File "lightgbm_param.py", line 79, in main
categorical_feature=cat_index_2)
File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py", line 443, in cv
cvfolds.update(fobj=fobj)
File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py", line 244, in handlerFunction
ret.append(getattr(booster, name)(*args, **kwargs))
File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 1436, in update
ctypes.byref(is_finished)))
File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 48, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError())
lightgbm.basic.LightGBMError: b'Bug in GPU histogram! split 8211: 11359, smaller_leaf: 9610, larger_leaf: 9960\n'

Reproducible examples

	params = {
			    'boosting_type': 'gbdt',
			    'objective': 'binary',
			    'metric': 'binary_logloss',
			    'num_leaves': 31,
			    'learning_rate': 0.005,
			    'feature_fraction': 0.9,
			    'bagging_fraction': 0.8,
			    'verbose': 1,
			    'device' : 'gpu'
			}

	d_train = lgb.Dataset(all_x, label=all_y)

	cv_results = lgb.cv(params,
			                d_train,
			                num_boost_round=700,
			                categorical_feature=cat_index_2)

The text was updated successfully, but these errors were encountered:

huanzhang12 · 2017-10-24T17:15:35Z

Thanks for reporting this problem! There might be a bug trigger by a race condition in the GPU code. I guess it is related to the feature_fraction and bagging_fraction parameters. Could you please change them to 1.0 and see which parameter causes the problem?

I will also really appreciate if you can reproduce the problem on any public datasets, or share the dataset with me if it is not sensitive. This will greatly help me debug this issue. Thank you!

lorenzoridolfi · 2017-10-29T14:16:06Z

Hi, setting these two parameters to 1.0 the bug happened, too, but It took several iterations to occur. With the old values the bug happened with very few iterations.

The source code is:
https://www.dropbox.com/s/bqj428pc5vwcpp9/lightgbm_param.py?dl=0

And the data files are:
https://www.dropbox.com/s/6lbpn54sdqn98kd/train.csv?dl=0
https://www.dropbox.com/s/lv8sam3tx415x62/test.csv?dl=0

Best Regards,
Lorenzo

huanzhang12 · 2017-11-02T04:44:51Z

@lorenzoridolfi Thank you for the detailed information on code and data! They are really helpful. I got a little bit busy recently but I will try to catch this bug as quickly as I can.

lorenzoridolfi · 2017-11-16T17:12:27Z

Any news about this bug? It's almost a month!

Thank you,
Lorenzo

Laurae2 · 2017-11-22T19:04:35Z

ping @huanzhang12 if you have any news

huanzhang12 · 2017-11-22T22:44:38Z

Sorry I got crazily busy recently and did not get a chance to look into this bug. Will try to work on this during thanksgiving holiday. Thanks for your understanding!

mjaysonnn · 2017-11-23T11:04:24Z

Is this bug related to bin size error? For example when I use GPU-version lgbm

"bin size 16855 cannot run on GPU" error happens.

guolinke · 2017-12-14T09:38:24Z

@mjaysonnn
GPU version cannot support categorical features with high cardinality.
You can fix it by split one categorical feature into multi categorical features.

mjmckp · 2018-07-23T03:50:21Z

I am also getting this error, using the latest version of LightGBM:

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 165885
[LightGBM] [Info] Number of data: 4561756, number of used features: 658
[LightGBM] [Info] Using GPU Device: GeForce GTX 1080 Ti, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (2836.48 MB) transfered to GPU in 1.880998 secs. 7 sparse feature groups
[LightGBM] [Info] Start training from score 0.466854
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.932567 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.892417 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.904232 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.882715 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.908154 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.907432 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.875952 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.865907 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.862585 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.892193 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.891429 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.915810 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.896881 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.894748 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.934688 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.908344 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.879211 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.877866 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.889270 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.850099 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.931063 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.910245 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.856059 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.905356 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.881266 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.867991 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.876403 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.873414 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.899580 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.908517 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.881924 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.907277 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.874490 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.889323 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.890838 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.871581 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.875327 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.885586 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.895304 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.895696 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.935202 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.870557 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.865199 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.912166 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.891637 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.882426 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.894549 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.854855 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.863332 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.881228 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.875864 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.885681 secs. 7 sparse feature groups
[LightGBM] [Fatal] Bug in GPU histogram! split 139388: 62131, smaller_leaf: 62132, larger_leaf: 139387

huanzhang12 · 2018-07-23T10:41:24Z

@mjmckp Could you please provide the dataset and the python/shell script you used to reproduce this error? This will be really helpful for me to debug this issue.

I tried to reproduce the bug with the dataset and code provided by @lorenzoridolfi but I cannot reproduce it on three different machines. I tried different feature_fraction and bagging_fraction values but still cannot make the bug appear. @lorenzoridolfi Could you please try the latest LightGBM and see if you are still encountering the same error?

mjmckp · 2018-07-24T06:06:23Z

Self-contained repro here: https://www.dropbox.com/sh/9f9u7wm5ithfjbr/AADcQ6k8yDSkA3J3vYqsg4Hta?dl=0

Unzip the file dataset.zip and run lightgbm.exe config=repro.conf, console output is in output.txt.

I am running with:

LightGBM built from the current master branch (8ce2a232e907d518979e7105842ae575a7427377)
Windows 10 Professional
NVidia GTX 1080 Ti

huanzhang12 · 2018-07-24T09:50:26Z

@mjmckp Thank you for providing the dataset and config files! I still cannot reproduce this problem on AMD and NVIDIA GPUs on my machines. However I did observe GPU hang on an Intel integrated GPU, which was not tested thoroughly before.

There might be a bug with max_bin=255. Could you please try to use max_bin=63 and see if this bug still occurs (make sure the log says Compiling OpenCL Kernel with 64 bins). If it disappears, I will investigate the OpenCL kernel for 256 bins carefully.

@mjmckp Another possibility is here: https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/gpu_tree_learner.cpp#L119
If changing max_bin=63 does not work, could you please also try uncomment this line (return 0;) to make GetNumWorkgroupsPerFeature return 0?

mjmckp · 2018-07-24T11:08:55Z

After setting max_bin=63 (both when creating the dataset and the trainer), I still get Compiling OpenCL Kernel with 256 bins..., how could this be?

huanzhang12 · 2018-07-24T11:10:22Z

@mjmckp you need to delete the binary training file and regenerate it using save_binary=true

mjmckp · 2018-07-24T12:08:19Z

Ok, thanks. Setting max_bin=63 also fails with the same error. I have updated the dropbox directory above with two new files:

output2.txt: console output
dataset2.zip: new dataset saved with max_bin=63

Btw, when trying to debug this, I tried using LightGBM compiled with #define GPU_DEBUG_COMPARE uncommented in gpu_tree_learner.cpp, however this generates an access violation. I also tried setting #define GPU_DEBUG 4, however this generates some compile errors and also runtime errors after working around the compile errors...

mjmckp · 2018-07-25T03:28:41Z

I also tried altering GetNumWorkgroupsPerFeature to return 0, and got the same exception.

huanzhang12 · 2018-07-25T05:03:17Z

@mjmckp Thank you for providing the new dataset and trying to debug this problem! Unfortunately, I still cannot reproduce the problem with max_bin=64. However, I fixed the GPU debugging mechanism. You can apply the patch here:
https://gist.github.com/huanzhang12/f4f462c56b1920c8e59f3c729e124447
and then #define GPU_DEBUG_COMPARE should work.

huanzhang12 · 2018-07-25T10:56:56Z

@mjmckp you can also try this branch and see if it fixes it:
https://github.com/Microsoft/LightGBM/tree/gpu_fix
I added a few more boundary checks in the GPU code, but I am not sure if this is the problem.

mjmckp · 2018-07-25T11:34:53Z

Thanks. Btw, I added output.zip to the Dropbox directory which contains the console output when run with the patch you gave me (using the second data set with max_bin=63). It contains several failures.

…

On Wed., 25 Jul. 2018, 8:57 pm Huan Zhang, ***@***.***> wrote: @mjmckp <https://github.com/mjmckp> you can also try this branch and see if it fixes it: https://github.com/Microsoft/LightGBM/tree/gpu_fix I added a few more boundary checks in the GPU code, but I am not sure if this is the problem. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1003 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHaqE9BvpGq8qah1FAY-iP4PAB5tJuBEks5uKE8FgaJpZM4QBF_D> .

huanzhang12 · 2018-07-25T20:07:23Z

@mjmckp Thank you for the very detailed debugging log! It seems some counter values are off by 1, however I still have no clue why this happens...

@mjmckp Is the error deterministic (occurs at the same iteration with the same wrong value) each time or it is random? Could you also try to reduce the dataset size and find a minimal dataset that can reproduce this error? Thanks!

mjmckp · 2018-07-25T20:12:01Z

I ran it again using a build from the gpu_fix branch, which fails almost immediately (instead of after a while like before). The output is in output3.txt in the dropbox folder.

mjmckp · 2018-07-25T22:43:04Z

The file outputs.zip in the Dropbox directory contains the console output from 3 identical runs, using LightGBM compiled from the gpu_fix branch. A diff on the files shows that the program always fails at the same point, however there are small numerical differences in the calculations leading up to this point.

huanzhang12 · 2018-07-25T23:20:15Z

@mjmckp I found that my fix actually introduces another bug, and I just fixed that in the gpu_fix branch.
Could you please re-run training and collect console outputs? Thanks!

Laurae2 · 2018-08-15T18:01:54Z

@mjmckp any news?

mjmckp · 2018-08-28T22:48:35Z

@huanzhang12 It turns out this was an issue with a faulty GPU, this issue can be closed now IMO

huanzhang12 · 2018-09-16T08:38:00Z

@mjmckp Thank you for reporting back that the issue is actually caused by a faulty CPU! LightGBM seems to be a good candidate for GPU stability test :)
@lorenzoridolfi Are you still encountering this issue? Can you try to replace GPU and see if still occurs?

jjdelvalle · 2018-09-19T13:18:01Z

@guolinke You mentioned in this issue that high cardinality variables are an issue for GPUs. Is there a way LightGBM could display which variable specifically is giving it problems? Alternatively, how does one check the cardinality of variables? I'm unsure what is meant by that... simply the number of unique categorical values?

guolinke · 2018-09-19T15:44:22Z

@clinchergt yeah, it is the number of unique categorical values.

jjdelvalle · 2018-09-19T16:02:30Z

@guolinke How is the number of bins determined? Is it directly correlated with the unique categorical values? How can I determine how many bins a specific variable is gonna need?

StrikerRUS · 2018-12-04T12:31:11Z

@huanzhang12 What is the fate of the gpu_fix branch? Can this issue be closed?

StrikerRUS · 2019-01-28T13:24:42Z

@huanzhang12 Seems that someone removed gpu_fix branch...

huanzhang12 · 2019-01-30T09:25:01Z

@StrikerRUS Yes that branch should be deleted. This issue can now be closed. If new problem arises, a new issue can be opened.

StrikerRUS · 2019-05-09T21:38:12Z

@huanzhang12 I've just caught the same error in our CI docker. Just switched compiler from gcc to clang here

LightGBM/.vsts-ci.yml

Line 13 in abbbbd7

COMPILER: gcc

Docker:
https://github.com/microsoft/LightGBM/blob/40e3048f6185bb8f3f50bd9fe7275cf514b03b16/.ci/dockers/ubuntu-14.04/Dockerfile

https://hub.docker.com/r/lightgbm/vsts-agent
Steps to reproduce:

LightGBM/.vsts-ci.yml

Lines 40 to 57 in 40e3048

    
             steps: 
        
             - script: | 
        
                 echo "##vso[task.setvariable variable=HOME_DIRECTORY]$AGENT_HOMEDIRECTORY" 
        
                 echo "##vso[task.setvariable variable=BUILD_DIRECTORY]$BUILD_SOURCESDIRECTORY" 
        
                 echo "##vso[task.setvariable variable=OS_NAME]linux" 
        
                 echo "##vso[task.setvariable variable=AZURE]true" 
        
                 echo "##vso[task.setvariable variable=LGB_VER]$(head -n 1 VERSION.txt)" 
        
                 echo "##vso[task.prependpath]$CONDA/bin" 
        
                 AMDAPPSDK_PATH=$BUILD_SOURCESDIRECTORY/AMDAPPSDK 
        
                 echo "##vso[task.setvariable variable=AMDAPPSDK_PATH]$AMDAPPSDK_PATH" 
        
                 LD_LIBRARY_PATH=$AMDAPPSDK_PATH/lib/x86_64:$LD_LIBRARY_PATH 
        
                 echo "##vso[task.setvariable variable=LD_LIBRARY_PATH]$LD_LIBRARY_PATH" 
        
                 echo "##vso[task.setvariable variable=OPENCL_VENDOR_PATH]$AMDAPPSDK_PATH/etc/OpenCL/vendors" 
        
               displayName: 'Set variables' 
        
             - bash: $(Build.SourcesDirectory)/.ci/setup.sh 
        
               displayName: Setup 
        
             - bash: $(Build.SourcesDirectory)/.ci/test.sh 
        
               displayName: Test

Logs can be found here: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=2045
Essential part of the log:

============================= test session starts ==============================
platform linux -- Python 3.6.8, pytest-4.4.1, py-1.8.0, pluggy-0.9.0
rootdir: /__w/1/s
collected 77 items

../tests/c_api_test/test_.py ..                                          [  2%]
../tests/python_package_test/test_basic.py ........F...                  [ 18%]
../tests/python_package_test/test_consistency.py ....                    [ 23%]
../tests/python_package_test/test_engine.py ............................ [ 59%]
.......                                                                  [ 68%]
../tests/python_package_test/test_plotting.py .....                      [ 75%]
../tests/python_package_test/test_sklearn.py ...................         [100%]

=================================== FAILURES ===================================
____________________ TestBasic.test_cegb_scaling_equalities ____________________

self = <test_basic.TestBasic testMethod=test_cegb_scaling_equalities>

    def test_cegb_scaling_equalities(self):
        X = np.random.random((1000, 5))
        X[:, [1, 3]] = 0
        y = np.random.random(1000)
        names = ['col_%d' % i for i in range(5)]
        ds = lgb.Dataset(X, feature_name=names).construct()
        ds.set_label(y)
        # Compare pairs of penalties, to ensure scaling works as intended
        pairs = [({'cegb_penalty_feature_coupled': [1, 2, 1, 2, 1]},
                  {'cegb_penalty_feature_coupled': [0.5, 1, 0.5, 1, 0.5], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_feature_lazy': [0.01, 0.02, 0.03, 0.04, 0.05]},
                  {'cegb_penalty_feature_lazy': [0.005, 0.01, 0.015, 0.02, 0.025], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_split': 1},
                  {'cegb_penalty_split': 2, 'cegb_tradeoff': 0.5})]
        for (p1, p2) in pairs:
            booster1 = lgb.Booster(train_set=ds, params=p1)
            booster2 = lgb.Booster(train_set=ds, params=p2)
            for k in range(10):
>               booster1.update()

../tests/python_package_test/test_basic.py:268: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:1885: in update
    ctypes.byref(is_finished)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def _safe_call(ret):
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
E           lightgbm.basic.LightGBMError: Bug in GPU histogram! split 25: 31, smaller_leaf: 9, larger_leaf: 11

/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:47: LightGBMError
----------------------------- Captured stdout call -----------------------------
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000079 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000077 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.512702
[LightGBM] [Info] Start training from score 0.512702
----------------------------- Captured stderr call -----------------------------
[LightGBM] [Fatal] Bug in GPU histogram! split 25: 31, smaller_leaf: 9, larger_leaf: 11

Re-running CI job vanished the error. Strange...

StrikerRUS · 2019-05-14T22:40:42Z

Caught this error today again but with gcc for this time:

============================= test session starts ==============================
platform linux -- Python 3.6.8, pytest-4.4.2, py-1.8.0, pluggy-0.11.0
rootdir: /__w/1/s
collected 77 items

../tests/c_api_test/test_.py ..                                          [  2%]
../tests/python_package_test/test_basic.py ........F...                  [ 18%]
../tests/python_package_test/test_consistency.py ....                    [ 23%]
../tests/python_package_test/test_engine.py ............................ [ 59%]
.......                                                                  [ 68%]
../tests/python_package_test/test_plotting.py .....                      [ 75%]
../tests/python_package_test/test_sklearn.py ...................         [100%]

=================================== FAILURES ===================================
____________________ TestBasic.test_cegb_scaling_equalities ____________________

self = <test_basic.TestBasic testMethod=test_cegb_scaling_equalities>

    def test_cegb_scaling_equalities(self):
        X = np.random.random((1000, 5))
        X[:, [1, 3]] = 0
        y = np.random.random(1000)
        names = ['col_%d' % i for i in range(5)]
        ds = lgb.Dataset(X, feature_name=names).construct()
        ds.set_label(y)
        # Compare pairs of penalties, to ensure scaling works as intended
        pairs = [({'cegb_penalty_feature_coupled': [1, 2, 1, 2, 1]},
                  {'cegb_penalty_feature_coupled': [0.5, 1, 0.5, 1, 0.5], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_feature_lazy': [0.01, 0.02, 0.03, 0.04, 0.05]},
                  {'cegb_penalty_feature_lazy': [0.005, 0.01, 0.015, 0.02, 0.025], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_split': 1},
                  {'cegb_penalty_split': 2, 'cegb_tradeoff': 0.5})]
        for (p1, p2) in pairs:
            booster1 = lgb.Booster(train_set=ds, params=p1)
            booster2 = lgb.Booster(train_set=ds, params=p2)
            for k in range(10):
>               booster1.update()

../tests/python_package_test/test_basic.py:268: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:1885: in update
    ctypes.byref(is_finished)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def _safe_call(ret):
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
E           lightgbm.basic.LightGBMError: Bug in GPU histogram! split 21: 22, smaller_leaf: 12, larger_leaf: 10

/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:47: LightGBMError
----------------------------- Captured stdout call -----------------------------
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000191 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000092 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.509289
----------------------------- Captured stderr call -----------------------------
[LightGBM] [Fatal] Bug in GPU histogram! split 21: 22, smaller_leaf: 12, larger_leaf: 10

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=2107

ashrith · 2019-05-24T03:59:23Z

@huanzhang12 Hi Huan, I am getting the same error when I run lightgbm on Nvidia 2080TI. The following is the error:

➜  higgs /home/bartha/LightGBM/lightgbm config=lightgbm_gpu.conf data=higgs.train valid=higgs.test objective=binary metric=auc save_binary=true
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Saving data to binary file higgs.train.bin
[LightGBM] [Info] Saving data to binary file higgs.test.bin
[LightGBM] [Info] Finished loading data in 13.653178 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 5564616, number of negative: 4935384
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1535
[LightGBM] [Info] Number of data: 10500000, number of used features: 28
[LightGBM] [Info] Using requested OpenCL platform 0 device 0
[LightGBM] [Info] Using GPU Device: GeForce RTX 2080 Ti, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 28 dense feature groups (280.38 MB) transferred to GPU in 0.290521 secs. 0 sparse feature groups
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529963 -> initscore=0.119997
[LightGBM] [Info] Start training from score 0.119997
[LightGBM] [Fatal] Bug in GPU histogram! split 1157984: 721078, smaller_leaf: 721073, larger_leaf: 1157989

Met Exceptions:
Bug in GPU histogram! split 1157984: 721078, smaller_leaf: 721073, larger_leaf: 1157989

Please let me know if you need more information I would be happy to help.
The dataset I am using is the higgs dataset. and the following is my config

verbosity = 2
max_bin = 63
num_leaves = 255
num_iterations = 50
learning_rate = 0.1
tree_learner = serial
task = train
is_training_metric = false
min_data_in_leaf = 1
min_sum_hessian_in_leaf = 100
ndcg_eval_at = 1,3,5,10
sparse_threshold = 1.0
device = gpu
gpu_platform_id = 0
gpu_device_id = 0
num_threads=32

It works perfectly fine when I run on a CPU, but fails on GPUs

StrikerRUS · 2019-06-17T11:50:39Z

Another one: https://lightgbm-ci.visualstudio.com/lightgbm-ci/_build/results?buildId=2380

And again in test_cegb_scaling_equalities test.

____________________ TestBasic.test_cegb_scaling_equalities ____________________

self = <test_basic.TestBasic testMethod=test_cegb_scaling_equalities>

    def test_cegb_scaling_equalities(self):
        X = np.random.random((1000, 5))
        X[:, [1, 3]] = 0
        y = np.random.random(1000)
        names = ['col_%d' % i for i in range(5)]
        ds = lgb.Dataset(X, feature_name=names).construct()
        ds.set_label(y)
        # Compare pairs of penalties, to ensure scaling works as intended
        pairs = [({'cegb_penalty_feature_coupled': [1, 2, 1, 2, 1]},
                  {'cegb_penalty_feature_coupled': [0.5, 1, 0.5, 1, 0.5], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_feature_lazy': [0.01, 0.02, 0.03, 0.04, 0.05]},
                  {'cegb_penalty_feature_lazy': [0.005, 0.01, 0.015, 0.02, 0.025], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_split': 1},
                  {'cegb_penalty_split': 2, 'cegb_tradeoff': 0.5})]
        for (p1, p2) in pairs:
            booster1 = lgb.Booster(train_set=ds, params=p1)
            booster2 = lgb.Booster(train_set=ds, params=p2)
            for k in range(10):
>               booster1.update()

../tests/python_package_test/test_basic.py:268: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:1896: in update
    ctypes.byref(is_finished)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def _safe_call(ret):
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
E           lightgbm.basic.LightGBMError: Bug in GPU histogram! split 20: 20, smaller_leaf: 13, larger_leaf: 19

/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:47: LightGBMError
----------------------------- Captured stdout call -----------------------------
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000043 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000075 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.499788
[LightGBM] [Info] Start training from score 0.499788
----------------------------- Captured stderr call -----------------------------
[LightGBM] [Fatal] Bug in GPU histogram! split 20: 20, smaller_leaf: 13, larger_leaf: 19

@huanzhang12 Can you please take a look at that test?

LightGBM/tests/python_package_test/test_basic.py

Line 250 in 21e356d

def test_cegb_scaling_equalities(self):

huanzhang12 · 2019-06-17T11:56:08Z

It is weird that such a simple test fails, especially they never failed before. I will take a look at this, but I have a very busy schedule recently so I probably cannot fix it immediately.

StrikerRUS · 2019-06-17T13:48:38Z

@huanzhang12 Thanks a lot! It's quite weird that the bug happens very rare but in the same test. CEGB and corresponding failing test was introduced in #2014.

StrikerRUS · 2019-08-29T09:26:13Z

Happened again yesterday after a long break.
https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=2821

=================================== FAILURES ===================================
____________________ TestBasic.test_cegb_scaling_equalities ____________________

self = <test_basic.TestBasic testMethod=test_cegb_scaling_equalities>

    def test_cegb_scaling_equalities(self):
        X = np.random.random((1000, 5))
        X[:, [1, 3]] = 0
        y = np.random.random(1000)
        names = ['col_%d' % i for i in range(5)]
        ds = lgb.Dataset(X, feature_name=names).construct()
        ds.set_label(y)
        # Compare pairs of penalties, to ensure scaling works as intended
        pairs = [({'cegb_penalty_feature_coupled': [1, 2, 1, 2, 1]},
                  {'cegb_penalty_feature_coupled': [0.5, 1, 0.5, 1, 0.5], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_feature_lazy': [0.01, 0.02, 0.03, 0.04, 0.05]},
                  {'cegb_penalty_feature_lazy': [0.005, 0.01, 0.015, 0.02, 0.025], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_split': 1},
                  {'cegb_penalty_split': 2, 'cegb_tradeoff': 0.5})]
        for (p1, p2) in pairs:
            booster1 = lgb.Booster(train_set=ds, params=p1)
            booster2 = lgb.Booster(train_set=ds, params=p2)
            for k in range(10):
>               booster1.update()

../tests/python_package_test/test_basic.py:268: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:1926: in update
    ctypes.byref(is_finished)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def _safe_call(ret):
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
E           lightgbm.basic.LightGBMError: Bug in GPU histogram! split 21: 27, smaller_leaf: 7, larger_leaf: 17

/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:47: LightGBMError
----------------------------- Captured stdout call -----------------------------
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000109 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000114 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.506630
----------------------------- Captured stderr call -----------------------------
[LightGBM] [Fatal] Bug in GPU histogram! split 21: 27, smaller_leaf: 7, larger_leaf: 17

StrikerRUS · 2019-10-01T23:09:50Z

@huanzhang12 Any success? Happened today one more time on Travis:

____________________ TestBasic.test_cegb_scaling_equalities ____________________
self = <test_basic.TestBasic testMethod=test_cegb_scaling_equalities>
    def test_cegb_scaling_equalities(self):
        X = np.random.random((1000, 5))
        X[:, [1, 3]] = 0
        y = np.random.random(1000)
        names = ['col_%d' % i for i in range(5)]
        ds = lgb.Dataset(X, feature_name=names).construct()
        ds.set_label(y)
        # Compare pairs of penalties, to ensure scaling works as intended
        pairs = [({'cegb_penalty_feature_coupled': [1, 2, 1, 2, 1]},
                  {'cegb_penalty_feature_coupled': [0.5, 1, 0.5, 1, 0.5], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_feature_lazy': [0.01, 0.02, 0.03, 0.04, 0.05]},
                  {'cegb_penalty_feature_lazy': [0.005, 0.01, 0.015, 0.02, 0.025], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_split': 1},
                  {'cegb_penalty_split': 2, 'cegb_tradeoff': 0.5})]
        for (p1, p2) in pairs:
            booster1 = lgb.Booster(train_set=ds, params=p1)
            booster2 = lgb.Booster(train_set=ds, params=p2)
            for k in range(10):
>               booster1.update()
../tests/python_package_test/test_basic.py:268: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../.local/lib/python3.5/site-packages/lightgbm/basic.py:1926: in update
    ctypes.byref(is_finished)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
ret = -1
    def _safe_call(ret):
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
E           lightgbm.basic.LightGBMError: Bug in GPU histogram! split 24: 24, smaller_leaf: 8, larger_leaf: 12
../../../../.local/lib/python3.5/site-packages/lightgbm/basic.py:47: LightGBMError
----------------------------- Captured stdout call -----------------------------
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000112 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000114 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.489811
----------------------------- Captured stderr call -----------------------------
[LightGBM] [Fatal] Bug in GPU histogram! split 24: 24, smaller_leaf: 8, larger_leaf: 12

StrikerRUS · 2019-10-03T21:48:34Z

One more time at Travis:

____________________ TestBasic.test_cegb_scaling_equalities ____________________
self = <test_basic.TestBasic testMethod=test_cegb_scaling_equalities>
    def test_cegb_scaling_equalities(self):
        X = np.random.random((1000, 5))
        X[:, [1, 3]] = 0
        y = np.random.random(1000)
        names = ['col_%d' % i for i in range(5)]
        ds = lgb.Dataset(X, feature_name=names).construct()
        ds.set_label(y)
        # Compare pairs of penalties, to ensure scaling works as intended
        pairs = [({'cegb_penalty_feature_coupled': [1, 2, 1, 2, 1]},
                  {'cegb_penalty_feature_coupled': [0.5, 1, 0.5, 1, 0.5], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_feature_lazy': [0.01, 0.02, 0.03, 0.04, 0.05]},
                  {'cegb_penalty_feature_lazy': [0.005, 0.01, 0.015, 0.02, 0.025], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_split': 1},
                  {'cegb_penalty_split': 2, 'cegb_tradeoff': 0.5})]
        for (p1, p2) in pairs:
            booster1 = lgb.Booster(train_set=ds, params=p1)
            booster2 = lgb.Booster(train_set=ds, params=p2)
            for k in range(10):
>               booster1.update()
../tests/python_package_test/test_basic.py:268: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:1969: in update
    ctypes.byref(is_finished)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
ret = -1
    def _safe_call(ret):
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
E           lightgbm.basic.LightGBMError: Bug in GPU histogram! split 20: 28, smaller_leaf: 12, larger_leaf: 16
../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:47: LightGBMError
----------------------------- Captured stdout call -----------------------------
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000111 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000110 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.490812
[LightGBM] [Info] Start training from score 0.490812
----------------------------- Captured stderr call -----------------------------
[LightGBM] [Fatal] Bug in GPU histogram! split 20: 28, smaller_leaf: 12, larger_leaf: 16

StrikerRUS · 2019-10-03T21:49:37Z

Reopening, as it error becomes quite frequent.

idudch · 2019-11-25T05:22:00Z

Reopening, as it error becomes quite frequent.
@StrikerRUS it is actually 'a feature, not a bug' - I found an explanation here #1116
GPU version cannot support categorical features with high cardinality.
You can fix it by split one categorical feature into multi categorical features.
and noticed that one of my features was of cardinality 780+. I dropped it and the model worked.

StrikerRUS · 2019-11-25T13:14:28Z

@Poltigo Thanks for your comment! But we are specking about different error messages.

[LightGBM] [Fatal] Bug in GPU histogram! split 21: 27, smaller_leaf: 7, larger_leaf: 17

Our failing test is very simple and there are no categorical features there. Bin size here is OK for GPU learner.

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12

pseudotensor · 2021-03-16T04:25:22Z

@Poltigo Exactly as @StrikerRUS said , I hit this randomly for no reason with categorical_features as explicitly empty. Has nothing to do with that. The test that hit this normally has passed 1000 times before.

File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 794, in fit
    categorical_feature=categorical_feature, callbacks=callbacks, init_model=init_model)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 637, in fit
    callbacks=callbacks, init_model=init_model)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 230, in train
    booster = Booster(params=params, train_set=train_set)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2104, in __init__
    ctypes.byref(self.handle)))
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 52, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 257 cannot run on GPU

The number of bins was 255 and there are no categorical features as explicitly chosen.

Numan100 · 2021-06-22T00:42:38Z

NV GTX 1050
Python 3.6.5
Cuda 11.2
Win 10-X64
vs 2019

My program even couldn't run GPU version, which Pycharm indicated:
My OSError: exception: access violation reading 0x0000000000000038
Any hints ?

Codes HERE:
import lightgbm as lgb
from sklearn.datasets import load_boston
data = lgb.Dataset(*load_boston(True))
lgb.train({'device':'gpu',},data)

Nicky-Jin · 2022-05-20T09:49:53Z

Guys, I met the same problem. I found my problem resulted from my data. After removing the invalid data (NA, inf, null) and the features without variance, the model works well on GPU. (Mine is RTX3070)

Numan100 · 2022-06-06T03:05:12Z

Thanks a lot, but I could not run gpu, one step before load data I suppose.  

…

------------------ 原始邮件 ------------------ 发件人: "microsoft/LightGBM" ***@***.***>; 发送时间: 2022年5月20日(星期五) 下午5:50 ***@***.***>; ***@***.******@***.***>; 主题: Re: [microsoft/LightGBM] BUG in GPU histogram (#1003) Guys, I met the same problem. I found my problem resulted from my data. After removing the invalid data (NA, inf, null) and the features without variance, the model works well on GPU. (Mine is RTX3070) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Laurae2 assigned huanzhang12 Oct 21, 2017

StrikerRUS added the bug label Oct 30, 2017

ewrfdv mentioned this issue Dec 14, 2017

bin size error #1116

Closed

StrikerRUS closed this as completed Jan 30, 2019

StrikerRUS mentioned this issue May 15, 2019

[python] added ability to pass first_metric_only in params #2175

Merged

StrikerRUS mentioned this issue Jun 17, 2019

[ci] added CODEOWNERS (fixes #2194) #2196

Merged

StrikerRUS reopened this Oct 3, 2019

StrikerRUS mentioned this issue May 11, 2020

v3.0.0rc1 #3071

Merged

guolinke mentioned this issue Aug 10, 2020

[WIP] next release (3.0.0) #3293

Closed

10 tasks

BUG in GPU histogram #1003

BUG in GPU histogram #1003

Comments

lorenzoridolfi commented Oct 20, 2017

Environment info

Error Message:

Reproducible examples

huanzhang12 commented Oct 24, 2017

lorenzoridolfi commented Oct 29, 2017 • edited Loading

huanzhang12 commented Nov 2, 2017

lorenzoridolfi commented Nov 16, 2017

Laurae2 commented Nov 22, 2017

huanzhang12 commented Nov 22, 2017

mjaysonnn commented Nov 23, 2017

guolinke commented Dec 14, 2017

mjmckp commented Jul 23, 2018

huanzhang12 commented Jul 23, 2018

mjmckp commented Jul 24, 2018

huanzhang12 commented Jul 24, 2018

mjmckp commented Jul 24, 2018

huanzhang12 commented Jul 24, 2018

mjmckp commented Jul 24, 2018

mjmckp commented Jul 25, 2018

huanzhang12 commented Jul 25, 2018

huanzhang12 commented Jul 25, 2018

mjmckp commented Jul 25, 2018 via email

huanzhang12 commented Jul 25, 2018

mjmckp commented Jul 25, 2018

mjmckp commented Jul 25, 2018

huanzhang12 commented Jul 25, 2018

Laurae2 commented Aug 15, 2018

mjmckp commented Aug 28, 2018

huanzhang12 commented Sep 16, 2018

jjdelvalle commented Sep 19, 2018

guolinke commented Sep 19, 2018

jjdelvalle commented Sep 19, 2018 • edited Loading

StrikerRUS commented Dec 4, 2018

StrikerRUS commented Jan 28, 2019

huanzhang12 commented Jan 30, 2019

StrikerRUS commented May 9, 2019

StrikerRUS commented May 14, 2019

ashrith commented May 24, 2019 • edited Loading

StrikerRUS commented Jun 17, 2019

huanzhang12 commented Jun 17, 2019

StrikerRUS commented Jun 17, 2019

StrikerRUS commented Aug 29, 2019

StrikerRUS commented Oct 1, 2019

StrikerRUS commented Oct 3, 2019

StrikerRUS commented Oct 3, 2019

idudch commented Nov 25, 2019

StrikerRUS commented Nov 25, 2019

pseudotensor commented Mar 16, 2021 • edited Loading

Numan100 commented Jun 22, 2021

Nicky-Jin commented May 20, 2022

Numan100 commented Jun 6, 2022 via email

lorenzoridolfi commented Oct 29, 2017 •

edited

Loading

jjdelvalle commented Sep 19, 2018 •

edited

Loading

ashrith commented May 24, 2019 •

edited

Loading

pseudotensor commented Mar 16, 2021 •

edited

Loading