Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG in GPU histogram #1003

Open
lorenzoridolfi opened this issue Oct 20, 2017 · 48 comments
Open

BUG in GPU histogram #1003

lorenzoridolfi opened this issue Oct 20, 2017 · 48 comments
Assignees
Labels

Comments

@lorenzoridolfi
Copy link

Environment info

Operating System: Fedora 26
CPU: I5
GPU: NVidia GTX 1060
C++/Python/R version:
Python 3.6.2
Cuda 9.0

Error Message:

[LightGBM] [Info] Number of positive: 17355, number of negative: 458814
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476169, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.048936 secs. 9 sparse feature groups.
[LightGBM] [Info] Number of positive: 17355, number of negative: 458814
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476169, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.048049 secs. 9 sparse feature groups.
[LightGBM] [Info] Number of positive: 17355, number of negative: 458814
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476169, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.039569 secs. 9 sparse feature groups.
[LightGBM] [Info] Number of positive: 17355, number of negative: 458815
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476170, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.035209 secs. 9 sparse feature groups.
[LightGBM] [Info] Number of positive: 17356, number of negative: 458815
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476171, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.040315 secs. 9 sparse feature groups.
[LightGBM] [Fatal] Bug in GPU histogram! split 8211: 11359, smaller_leaf: 9610, larger_leaf: 9960

Traceback (most recent call last):
File "lightgbm_param.py", line 127, in
main()
File "lightgbm_param.py", line 79, in main
categorical_feature=cat_index_2)
File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py", line 443, in cv
cvfolds.update(fobj=fobj)
File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py", line 244, in handlerFunction
ret.append(getattr(booster, name)(*args, **kwargs))
File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 1436, in update
ctypes.byref(is_finished)))
File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 48, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError())
lightgbm.basic.LightGBMError: b'Bug in GPU histogram! split 8211: 11359, smaller_leaf: 9610, larger_leaf: 9960\n'

Reproducible examples

	params = {
			    'boosting_type': 'gbdt',
			    'objective': 'binary',
			    'metric': 'binary_logloss',
			    'num_leaves': 31,
			    'learning_rate': 0.005,
			    'feature_fraction': 0.9,
			    'bagging_fraction': 0.8,
			    'verbose': 1,
			    'device' : 'gpu'
			}

	d_train = lgb.Dataset(all_x, label=all_y)

	cv_results = lgb.cv(params,
			                d_train,
			                num_boost_round=700,
			                categorical_feature=cat_index_2)
@huanzhang12
Copy link
Contributor

Thanks for reporting this problem! There might be a bug trigger by a race condition in the GPU code. I guess it is related to the feature_fraction and bagging_fraction parameters. Could you please change them to 1.0 and see which parameter causes the problem?

I will also really appreciate if you can reproduce the problem on any public datasets, or share the dataset with me if it is not sensitive. This will greatly help me debug this issue. Thank you!

@lorenzoridolfi
Copy link
Author

lorenzoridolfi commented Oct 29, 2017

Hi, setting these two parameters to 1.0 the bug happened, too, but It took several iterations to occur. With the old values the bug happened with very few iterations.

The source code is:
https://www.dropbox.com/s/bqj428pc5vwcpp9/lightgbm_param.py?dl=0

And the data files are:
https://www.dropbox.com/s/6lbpn54sdqn98kd/train.csv?dl=0
https://www.dropbox.com/s/lv8sam3tx415x62/test.csv?dl=0

Best Regards,
Lorenzo

@StrikerRUS StrikerRUS added the bug label Oct 30, 2017
@huanzhang12
Copy link
Contributor

@lorenzoridolfi Thank you for the detailed information on code and data! They are really helpful. I got a little bit busy recently but I will try to catch this bug as quickly as I can.

@lorenzoridolfi
Copy link
Author

Any news about this bug? It's almost a month!

Thank you,
Lorenzo

@Laurae2
Copy link
Contributor

Laurae2 commented Nov 22, 2017

ping @huanzhang12 if you have any news

@huanzhang12
Copy link
Contributor

Sorry I got crazily busy recently and did not get a chance to look into this bug. Will try to work on this during thanksgiving holiday. Thanks for your understanding!

@mjaysonnn
Copy link

Is this bug related to bin size error? For example when I use GPU-version lgbm

"bin size 16855 cannot run on GPU" error happens.

@ewrfdv ewrfdv mentioned this issue Dec 14, 2017
@guolinke
Copy link
Collaborator

@mjaysonnn
GPU version cannot support categorical features with high cardinality.
You can fix it by split one categorical feature into multi categorical features.

@mjmckp
Copy link
Contributor

mjmckp commented Jul 23, 2018

I am also getting this error, using the latest version of LightGBM:

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 165885
[LightGBM] [Info] Number of data: 4561756, number of used features: 658
[LightGBM] [Info] Using GPU Device: GeForce GTX 1080 Ti, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (2836.48 MB) transfered to GPU in 1.880998 secs. 7 sparse feature groups
[LightGBM] [Info] Start training from score 0.466854
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.932567 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.892417 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.904232 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.882715 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.908154 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.907432 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.875952 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.865907 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.862585 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.892193 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.891429 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.915810 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.896881 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.894748 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.934688 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.908344 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.879211 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.877866 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.889270 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.850099 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.931063 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.910245 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.856059 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.905356 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.881266 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.867991 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.876403 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.873414 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.899580 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.908517 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.881924 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.907277 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.874490 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.889323 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.890838 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.871581 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.875327 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.885586 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.895304 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.895696 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.935202 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.870557 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.865199 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.912166 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.891637 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.882426 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.894549 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.854855 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.863332 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.881228 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.875864 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.885681 secs. 7 sparse feature groups
[LightGBM] [Fatal] Bug in GPU histogram! split 139388: 62131, smaller_leaf: 62132, larger_leaf: 139387

@huanzhang12
Copy link
Contributor

@mjmckp Could you please provide the dataset and the python/shell script you used to reproduce this error? This will be really helpful for me to debug this issue.

I tried to reproduce the bug with the dataset and code provided by @lorenzoridolfi but I cannot reproduce it on three different machines. I tried different feature_fraction and bagging_fraction values but still cannot make the bug appear. @lorenzoridolfi Could you please try the latest LightGBM and see if you are still encountering the same error?

@mjmckp
Copy link
Contributor

mjmckp commented Jul 24, 2018

Self-contained repro here: https://www.dropbox.com/sh/9f9u7wm5ithfjbr/AADcQ6k8yDSkA3J3vYqsg4Hta?dl=0

Unzip the file dataset.zip and run lightgbm.exe config=repro.conf, console output is in output.txt.

I am running with:

  • LightGBM built from the current master branch (8ce2a232e907d518979e7105842ae575a7427377)
  • Windows 10 Professional
  • NVidia GTX 1080 Ti

@huanzhang12
Copy link
Contributor

@mjmckp Thank you for providing the dataset and config files! I still cannot reproduce this problem on AMD and NVIDIA GPUs on my machines. However I did observe GPU hang on an Intel integrated GPU, which was not tested thoroughly before.

There might be a bug with max_bin=255. Could you please try to use max_bin=63 and see if this bug still occurs (make sure the log says Compiling OpenCL Kernel with 64 bins). If it disappears, I will investigate the OpenCL kernel for 256 bins carefully.

@mjmckp Another possibility is here: https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/gpu_tree_learner.cpp#L119
If changing max_bin=63 does not work, could you please also try uncomment this line (return 0;) to make GetNumWorkgroupsPerFeature return 0?

@mjmckp
Copy link
Contributor

mjmckp commented Jul 24, 2018

After setting max_bin=63 (both when creating the dataset and the trainer), I still get Compiling OpenCL Kernel with 256 bins..., how could this be?

@huanzhang12
Copy link
Contributor

@mjmckp you need to delete the binary training file and regenerate it using save_binary=true

@mjmckp
Copy link
Contributor

mjmckp commented Jul 24, 2018

Ok, thanks. Setting max_bin=63 also fails with the same error. I have updated the dropbox directory above with two new files:

  • output2.txt: console output
  • dataset2.zip: new dataset saved with max_bin=63

Btw, when trying to debug this, I tried using LightGBM compiled with #define GPU_DEBUG_COMPARE uncommented in gpu_tree_learner.cpp, however this generates an access violation. I also tried setting #define GPU_DEBUG 4, however this generates some compile errors and also runtime errors after working around the compile errors...

@mjmckp
Copy link
Contributor

mjmckp commented Jul 25, 2018

I also tried altering GetNumWorkgroupsPerFeature to return 0, and got the same exception.

@huanzhang12
Copy link
Contributor

@mjmckp Thank you for providing the new dataset and trying to debug this problem! Unfortunately, I still cannot reproduce the problem with max_bin=64. However, I fixed the GPU debugging mechanism. You can apply the patch here:
https://gist.github.com/huanzhang12/f4f462c56b1920c8e59f3c729e124447
and then #define GPU_DEBUG_COMPARE should work.

@huanzhang12
Copy link
Contributor

@mjmckp you can also try this branch and see if it fixes it:
https://github.com/Microsoft/LightGBM/tree/gpu_fix
I added a few more boundary checks in the GPU code, but I am not sure if this is the problem.

@mjmckp
Copy link
Contributor

mjmckp commented Jul 25, 2018 via email

@huanzhang12
Copy link
Contributor

@mjmckp Thank you for the very detailed debugging log! It seems some counter values are off by 1, however I still have no clue why this happens...

@mjmckp Is the error deterministic (occurs at the same iteration with the same wrong value) each time or it is random? Could you also try to reduce the dataset size and find a minimal dataset that can reproduce this error? Thanks!

@mjmckp
Copy link
Contributor

mjmckp commented Jul 25, 2018

I ran it again using a build from the gpu_fix branch, which fails almost immediately (instead of after a while like before). The output is in output3.txt in the dropbox folder.

@mjmckp
Copy link
Contributor

mjmckp commented Jul 25, 2018

The file outputs.zip in the Dropbox directory contains the console output from 3 identical runs, using LightGBM compiled from the gpu_fix branch. A diff on the files shows that the program always fails at the same point, however there are small numerical differences in the calculations leading up to this point.

@huanzhang12
Copy link
Contributor

@mjmckp I found that my fix actually introduces another bug, and I just fixed that in the gpu_fix branch.
Could you please re-run training and collect console outputs? Thanks!

@Laurae2
Copy link
Contributor

Laurae2 commented Aug 15, 2018

@mjmckp any news?

@mjmckp
Copy link
Contributor

mjmckp commented Aug 28, 2018

@huanzhang12 It turns out this was an issue with a faulty GPU, this issue can be closed now IMO

@huanzhang12
Copy link
Contributor

@mjmckp Thank you for reporting back that the issue is actually caused by a faulty CPU! LightGBM seems to be a good candidate for GPU stability test :)
@lorenzoridolfi Are you still encountering this issue? Can you try to replace GPU and see if still occurs?

@jjdelvalle
Copy link

@guolinke You mentioned in this issue that high cardinality variables are an issue for GPUs. Is there a way LightGBM could display which variable specifically is giving it problems? Alternatively, how does one check the cardinality of variables? I'm unsure what is meant by that... simply the number of unique categorical values?

@guolinke
Copy link
Collaborator

@clinchergt yeah, it is the number of unique categorical values.

@jjdelvalle
Copy link

jjdelvalle commented Sep 19, 2018

@guolinke How is the number of bins determined? Is it directly correlated with the unique categorical values? How can I determine how many bins a specific variable is gonna need?

@StrikerRUS
Copy link
Collaborator

@huanzhang12 What is the fate of the gpu_fix branch? Can this issue be closed?

@StrikerRUS
Copy link
Collaborator

@huanzhang12 Seems that someone removed gpu_fix branch...

@huanzhang12
Copy link
Contributor

@StrikerRUS Yes that branch should be deleted. This issue can now be closed. If new problem arises, a new issue can be opened.

@StrikerRUS
Copy link
Collaborator

@huanzhang12 I've just caught the same error in our CI docker. Just switched compiler from gcc to clang here

COMPILER: gcc

Docker:
https://github.com/microsoft/LightGBM/blob/40e3048f6185bb8f3f50bd9fe7275cf514b03b16/.ci/dockers/ubuntu-14.04/Dockerfile

https://hub.docker.com/r/lightgbm/vsts-agent
Steps to reproduce:

LightGBM/.vsts-ci.yml

Lines 40 to 57 in 40e3048

steps:
- script: |
echo "##vso[task.setvariable variable=HOME_DIRECTORY]$AGENT_HOMEDIRECTORY"
echo "##vso[task.setvariable variable=BUILD_DIRECTORY]$BUILD_SOURCESDIRECTORY"
echo "##vso[task.setvariable variable=OS_NAME]linux"
echo "##vso[task.setvariable variable=AZURE]true"
echo "##vso[task.setvariable variable=LGB_VER]$(head -n 1 VERSION.txt)"
echo "##vso[task.prependpath]$CONDA/bin"
AMDAPPSDK_PATH=$BUILD_SOURCESDIRECTORY/AMDAPPSDK
echo "##vso[task.setvariable variable=AMDAPPSDK_PATH]$AMDAPPSDK_PATH"
LD_LIBRARY_PATH=$AMDAPPSDK_PATH/lib/x86_64:$LD_LIBRARY_PATH
echo "##vso[task.setvariable variable=LD_LIBRARY_PATH]$LD_LIBRARY_PATH"
echo "##vso[task.setvariable variable=OPENCL_VENDOR_PATH]$AMDAPPSDK_PATH/etc/OpenCL/vendors"
displayName: 'Set variables'
- bash: $(Build.SourcesDirectory)/.ci/setup.sh
displayName: Setup
- bash: $(Build.SourcesDirectory)/.ci/test.sh
displayName: Test

Logs can be found here: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=2045
Essential part of the log:

============================= test session starts ==============================
platform linux -- Python 3.6.8, pytest-4.4.1, py-1.8.0, pluggy-0.9.0
rootdir: /__w/1/s
collected 77 items

../tests/c_api_test/test_.py ..                                          [  2%]
../tests/python_package_test/test_basic.py ........F...                  [ 18%]
../tests/python_package_test/test_consistency.py ....                    [ 23%]
../tests/python_package_test/test_engine.py ............................ [ 59%]
.......                                                                  [ 68%]
../tests/python_package_test/test_plotting.py .....                      [ 75%]
../tests/python_package_test/test_sklearn.py ...................         [100%]

=================================== FAILURES ===================================
____________________ TestBasic.test_cegb_scaling_equalities ____________________

self = <test_basic.TestBasic testMethod=test_cegb_scaling_equalities>

    def test_cegb_scaling_equalities(self):
        X = np.random.random((1000, 5))
        X[:, [1, 3]] = 0
        y = np.random.random(1000)
        names = ['col_%d' % i for i in range(5)]
        ds = lgb.Dataset(X, feature_name=names).construct()
        ds.set_label(y)
        # Compare pairs of penalties, to ensure scaling works as intended
        pairs = [({'cegb_penalty_feature_coupled': [1, 2, 1, 2, 1]},
                  {'cegb_penalty_feature_coupled': [0.5, 1, 0.5, 1, 0.5], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_feature_lazy': [0.01, 0.02, 0.03, 0.04, 0.05]},
                  {'cegb_penalty_feature_lazy': [0.005, 0.01, 0.015, 0.02, 0.025], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_split': 1},
                  {'cegb_penalty_split': 2, 'cegb_tradeoff': 0.5})]
        for (p1, p2) in pairs:
            booster1 = lgb.Booster(train_set=ds, params=p1)
            booster2 = lgb.Booster(train_set=ds, params=p2)
            for k in range(10):
>               booster1.update()

../tests/python_package_test/test_basic.py:268: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:1885: in update
    ctypes.byref(is_finished)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def _safe_call(ret):
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
E           lightgbm.basic.LightGBMError: Bug in GPU histogram! split 25: 31, smaller_leaf: 9, larger_leaf: 11

/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:47: LightGBMError
----------------------------- Captured stdout call -----------------------------
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000079 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000077 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.512702
[LightGBM] [Info] Start training from score 0.512702
----------------------------- Captured stderr call -----------------------------
[LightGBM] [Fatal] Bug in GPU histogram! split 25: 31, smaller_leaf: 9, larger_leaf: 11

Re-running CI job vanished the error. Strange...

@StrikerRUS
Copy link
Collaborator

Caught this error today again but with gcc for this time:

============================= test session starts ==============================
platform linux -- Python 3.6.8, pytest-4.4.2, py-1.8.0, pluggy-0.11.0
rootdir: /__w/1/s
collected 77 items

../tests/c_api_test/test_.py ..                                          [  2%]
../tests/python_package_test/test_basic.py ........F...                  [ 18%]
../tests/python_package_test/test_consistency.py ....                    [ 23%]
../tests/python_package_test/test_engine.py ............................ [ 59%]
.......                                                                  [ 68%]
../tests/python_package_test/test_plotting.py .....                      [ 75%]
../tests/python_package_test/test_sklearn.py ...................         [100%]

=================================== FAILURES ===================================
____________________ TestBasic.test_cegb_scaling_equalities ____________________

self = <test_basic.TestBasic testMethod=test_cegb_scaling_equalities>

    def test_cegb_scaling_equalities(self):
        X = np.random.random((1000, 5))
        X[:, [1, 3]] = 0
        y = np.random.random(1000)
        names = ['col_%d' % i for i in range(5)]
        ds = lgb.Dataset(X, feature_name=names).construct()
        ds.set_label(y)
        # Compare pairs of penalties, to ensure scaling works as intended
        pairs = [({'cegb_penalty_feature_coupled': [1, 2, 1, 2, 1]},
                  {'cegb_penalty_feature_coupled': [0.5, 1, 0.5, 1, 0.5], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_feature_lazy': [0.01, 0.02, 0.03, 0.04, 0.05]},
                  {'cegb_penalty_feature_lazy': [0.005, 0.01, 0.015, 0.02, 0.025], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_split': 1},
                  {'cegb_penalty_split': 2, 'cegb_tradeoff': 0.5})]
        for (p1, p2) in pairs:
            booster1 = lgb.Booster(train_set=ds, params=p1)
            booster2 = lgb.Booster(train_set=ds, params=p2)
            for k in range(10):
>               booster1.update()

../tests/python_package_test/test_basic.py:268: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:1885: in update
    ctypes.byref(is_finished)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def _safe_call(ret):
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
E           lightgbm.basic.LightGBMError: Bug in GPU histogram! split 21: 22, smaller_leaf: 12, larger_leaf: 10

/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:47: LightGBMError
----------------------------- Captured stdout call -----------------------------
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000191 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000092 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.509289
----------------------------- Captured stderr call -----------------------------
[LightGBM] [Fatal] Bug in GPU histogram! split 21: 22, smaller_leaf: 12, larger_leaf: 10

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=2107

@ashrith
Copy link

ashrith commented May 24, 2019

@huanzhang12 Hi Huan, I am getting the same error when I run lightgbm on Nvidia 2080TI. The following is the error:

➜  higgs /home/bartha/LightGBM/lightgbm config=lightgbm_gpu.conf data=higgs.train valid=higgs.test objective=binary metric=auc save_binary=true
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Saving data to binary file higgs.train.bin
[LightGBM] [Info] Saving data to binary file higgs.test.bin
[LightGBM] [Info] Finished loading data in 13.653178 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 5564616, number of negative: 4935384
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1535
[LightGBM] [Info] Number of data: 10500000, number of used features: 28
[LightGBM] [Info] Using requested OpenCL platform 0 device 0
[LightGBM] [Info] Using GPU Device: GeForce RTX 2080 Ti, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 28 dense feature groups (280.38 MB) transferred to GPU in 0.290521 secs. 0 sparse feature groups
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529963 -> initscore=0.119997
[LightGBM] [Info] Start training from score 0.119997
[LightGBM] [Fatal] Bug in GPU histogram! split 1157984: 721078, smaller_leaf: 721073, larger_leaf: 1157989

Met Exceptions:
Bug in GPU histogram! split 1157984: 721078, smaller_leaf: 721073, larger_leaf: 1157989

Please let me know if you need more information I would be happy to help.
The dataset I am using is the higgs dataset. and the following is my config

verbosity = 2
max_bin = 63
num_leaves = 255
num_iterations = 50
learning_rate = 0.1
tree_learner = serial
task = train
is_training_metric = false
min_data_in_leaf = 1
min_sum_hessian_in_leaf = 100
ndcg_eval_at = 1,3,5,10
sparse_threshold = 1.0
device = gpu
gpu_platform_id = 0
gpu_device_id = 0
num_threads=32

It works perfectly fine when I run on a CPU, but fails on GPUs

@StrikerRUS
Copy link
Collaborator

Another one: https://lightgbm-ci.visualstudio.com/lightgbm-ci/_build/results?buildId=2380

And again in test_cegb_scaling_equalities test.

____________________ TestBasic.test_cegb_scaling_equalities ____________________

self = <test_basic.TestBasic testMethod=test_cegb_scaling_equalities>

    def test_cegb_scaling_equalities(self):
        X = np.random.random((1000, 5))
        X[:, [1, 3]] = 0
        y = np.random.random(1000)
        names = ['col_%d' % i for i in range(5)]
        ds = lgb.Dataset(X, feature_name=names).construct()
        ds.set_label(y)
        # Compare pairs of penalties, to ensure scaling works as intended
        pairs = [({'cegb_penalty_feature_coupled': [1, 2, 1, 2, 1]},
                  {'cegb_penalty_feature_coupled': [0.5, 1, 0.5, 1, 0.5], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_feature_lazy': [0.01, 0.02, 0.03, 0.04, 0.05]},
                  {'cegb_penalty_feature_lazy': [0.005, 0.01, 0.015, 0.02, 0.025], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_split': 1},
                  {'cegb_penalty_split': 2, 'cegb_tradeoff': 0.5})]
        for (p1, p2) in pairs:
            booster1 = lgb.Booster(train_set=ds, params=p1)
            booster2 = lgb.Booster(train_set=ds, params=p2)
            for k in range(10):
>               booster1.update()

../tests/python_package_test/test_basic.py:268: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:1896: in update
    ctypes.byref(is_finished)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def _safe_call(ret):
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
E           lightgbm.basic.LightGBMError: Bug in GPU histogram! split 20: 20, smaller_leaf: 13, larger_leaf: 19

/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:47: LightGBMError
----------------------------- Captured stdout call -----------------------------
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000043 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000075 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.499788
[LightGBM] [Info] Start training from score 0.499788
----------------------------- Captured stderr call -----------------------------
[LightGBM] [Fatal] Bug in GPU histogram! split 20: 20, smaller_leaf: 13, larger_leaf: 19

@huanzhang12 Can you please take a look at that test?

def test_cegb_scaling_equalities(self):

@huanzhang12
Copy link
Contributor

It is weird that such a simple test fails, especially they never failed before. I will take a look at this, but I have a very busy schedule recently so I probably cannot fix it immediately.

@StrikerRUS
Copy link
Collaborator

@huanzhang12 Thanks a lot! It's quite weird that the bug happens very rare but in the same test. CEGB and corresponding failing test was introduced in #2014.

@StrikerRUS
Copy link
Collaborator

Happened again yesterday after a long break.
https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=2821

=================================== FAILURES ===================================
____________________ TestBasic.test_cegb_scaling_equalities ____________________

self = <test_basic.TestBasic testMethod=test_cegb_scaling_equalities>

    def test_cegb_scaling_equalities(self):
        X = np.random.random((1000, 5))
        X[:, [1, 3]] = 0
        y = np.random.random(1000)
        names = ['col_%d' % i for i in range(5)]
        ds = lgb.Dataset(X, feature_name=names).construct()
        ds.set_label(y)
        # Compare pairs of penalties, to ensure scaling works as intended
        pairs = [({'cegb_penalty_feature_coupled': [1, 2, 1, 2, 1]},
                  {'cegb_penalty_feature_coupled': [0.5, 1, 0.5, 1, 0.5], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_feature_lazy': [0.01, 0.02, 0.03, 0.04, 0.05]},
                  {'cegb_penalty_feature_lazy': [0.005, 0.01, 0.015, 0.02, 0.025], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_split': 1},
                  {'cegb_penalty_split': 2, 'cegb_tradeoff': 0.5})]
        for (p1, p2) in pairs:
            booster1 = lgb.Booster(train_set=ds, params=p1)
            booster2 = lgb.Booster(train_set=ds, params=p2)
            for k in range(10):
>               booster1.update()

../tests/python_package_test/test_basic.py:268: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:1926: in update
    ctypes.byref(is_finished)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def _safe_call(ret):
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
E           lightgbm.basic.LightGBMError: Bug in GPU histogram! split 21: 27, smaller_leaf: 7, larger_leaf: 17

/home/vsts_azpcontainer/.local/lib/python3.6/site-packages/lightgbm/basic.py:47: LightGBMError
----------------------------- Captured stdout call -----------------------------
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000109 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000114 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.506630
----------------------------- Captured stderr call -----------------------------
[LightGBM] [Fatal] Bug in GPU histogram! split 21: 27, smaller_leaf: 7, larger_leaf: 17

@StrikerRUS
Copy link
Collaborator

@huanzhang12 Any success? Happened today one more time on Travis:

____________________ TestBasic.test_cegb_scaling_equalities ____________________
self = <test_basic.TestBasic testMethod=test_cegb_scaling_equalities>
    def test_cegb_scaling_equalities(self):
        X = np.random.random((1000, 5))
        X[:, [1, 3]] = 0
        y = np.random.random(1000)
        names = ['col_%d' % i for i in range(5)]
        ds = lgb.Dataset(X, feature_name=names).construct()
        ds.set_label(y)
        # Compare pairs of penalties, to ensure scaling works as intended
        pairs = [({'cegb_penalty_feature_coupled': [1, 2, 1, 2, 1]},
                  {'cegb_penalty_feature_coupled': [0.5, 1, 0.5, 1, 0.5], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_feature_lazy': [0.01, 0.02, 0.03, 0.04, 0.05]},
                  {'cegb_penalty_feature_lazy': [0.005, 0.01, 0.015, 0.02, 0.025], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_split': 1},
                  {'cegb_penalty_split': 2, 'cegb_tradeoff': 0.5})]
        for (p1, p2) in pairs:
            booster1 = lgb.Booster(train_set=ds, params=p1)
            booster2 = lgb.Booster(train_set=ds, params=p2)
            for k in range(10):
>               booster1.update()
../tests/python_package_test/test_basic.py:268: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../.local/lib/python3.5/site-packages/lightgbm/basic.py:1926: in update
    ctypes.byref(is_finished)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
ret = -1
    def _safe_call(ret):
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
E           lightgbm.basic.LightGBMError: Bug in GPU histogram! split 24: 24, smaller_leaf: 8, larger_leaf: 12
../../../../.local/lib/python3.5/site-packages/lightgbm/basic.py:47: LightGBMError
----------------------------- Captured stdout call -----------------------------
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000112 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000114 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.489811
----------------------------- Captured stderr call -----------------------------
[LightGBM] [Fatal] Bug in GPU histogram! split 24: 24, smaller_leaf: 8, larger_leaf: 12

@StrikerRUS
Copy link
Collaborator

One more time at Travis:

____________________ TestBasic.test_cegb_scaling_equalities ____________________
self = <test_basic.TestBasic testMethod=test_cegb_scaling_equalities>
    def test_cegb_scaling_equalities(self):
        X = np.random.random((1000, 5))
        X[:, [1, 3]] = 0
        y = np.random.random(1000)
        names = ['col_%d' % i for i in range(5)]
        ds = lgb.Dataset(X, feature_name=names).construct()
        ds.set_label(y)
        # Compare pairs of penalties, to ensure scaling works as intended
        pairs = [({'cegb_penalty_feature_coupled': [1, 2, 1, 2, 1]},
                  {'cegb_penalty_feature_coupled': [0.5, 1, 0.5, 1, 0.5], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_feature_lazy': [0.01, 0.02, 0.03, 0.04, 0.05]},
                  {'cegb_penalty_feature_lazy': [0.005, 0.01, 0.015, 0.02, 0.025], 'cegb_tradeoff': 2}),
                 ({'cegb_penalty_split': 1},
                  {'cegb_penalty_split': 2, 'cegb_tradeoff': 0.5})]
        for (p1, p2) in pairs:
            booster1 = lgb.Booster(train_set=ds, params=p1)
            booster2 = lgb.Booster(train_set=ds, params=p2)
            for k in range(10):
>               booster1.update()
../tests/python_package_test/test_basic.py:268: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:1969: in update
    ctypes.byref(is_finished)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
ret = -1
    def _safe_call(ret):
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
E           lightgbm.basic.LightGBMError: Bug in GPU histogram! split 20: 28, smaller_leaf: 12, larger_leaf: 16
../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:47: LightGBMError
----------------------------- Captured stdout call -----------------------------
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000111 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 3 dense feature groups (0.00 MB) transferred to GPU in 0.000110 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.490812
[LightGBM] [Info] Start training from score 0.490812
----------------------------- Captured stderr call -----------------------------
[LightGBM] [Fatal] Bug in GPU histogram! split 20: 28, smaller_leaf: 12, larger_leaf: 16

@StrikerRUS
Copy link
Collaborator

Reopening, as it error becomes quite frequent.

@StrikerRUS StrikerRUS reopened this Oct 3, 2019
@idudch
Copy link

idudch commented Nov 25, 2019

Reopening, as it error becomes quite frequent.
@StrikerRUS it is actually 'a feature, not a bug' - I found an explanation here #1116
GPU version cannot support categorical features with high cardinality.
You can fix it by split one categorical feature into multi categorical features.
and noticed that one of my features was of cardinality 780+. I dropped it and the model worked.

@StrikerRUS
Copy link
Collaborator

@Poltigo Thanks for your comment! But we are specking about different error messages.

[LightGBM] [Fatal] Bug in GPU histogram! split 21: 27, smaller_leaf: 7, larger_leaf: 17

Our failing test is very simple and there are no categorical features there. Bin size here is OK for GPU learner.

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data: 1000, number of used features: 3
[LightGBM] [Info] Using GPU Device: Intel(R) Xeon(R) CPU @ 2.30GHz, Vendor: GenuineIntel
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12

@pseudotensor
Copy link

pseudotensor commented Mar 16, 2021

@Poltigo Exactly as @StrikerRUS said , I hit this randomly for no reason with categorical_features as explicitly empty. Has nothing to do with that. The test that hit this normally has passed 1000 times before.

File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 794, in fit
    categorical_feature=categorical_feature, callbacks=callbacks, init_model=init_model)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 637, in fit
    callbacks=callbacks, init_model=init_model)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 230, in train
    booster = Booster(params=params, train_set=train_set)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2104, in __init__
    ctypes.byref(self.handle)))
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 52, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 257 cannot run on GPU

The number of bins was 255 and there are no categorical features as explicitly chosen.

@Numan100
Copy link

NV GTX 1050
Python 3.6.5
Cuda 11.2
Win 10-X64
vs 2019

My program even couldn't run GPU version, which Pycharm indicated:
My OSError: exception: access violation reading 0x0000000000000038
Any hints ?

Codes HERE:
import lightgbm as lgb
from sklearn.datasets import load_boston
data = lgb.Dataset(*load_boston(True))
lgb.train({'device':'gpu',},data)

@Nicky-Jin
Copy link

Guys, I met the same problem. I found my problem resulted from my data. After removing the invalid data (NA, inf, null) and the features without variance, the model works well on GPU. (Mine is RTX3070)

@Numan100
Copy link

Numan100 commented Jun 6, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests