LightGBM vs XGBoost accuracy/speed #3417

RAMitchell · 2018-06-28T03:27:22Z

@kaz-Anova recently pointed out that XGBoost is falling behind LightGBM in accuracy on recent Kaggle competitions.

This is referring to the "tree_method":"hist" algorithm. If anyone has time it would be nice to figure out the root cause of this.

Speed is also a priority but I think less so than accuracy.

cc @hcho3

hcho3 · 2018-06-28T05:48:18Z

@kaz-Anova @RAMitchell Is the issue specific with CPU-hist or do you see the same issue with GPU-hist as well?

RAMitchell · 2018-06-28T08:28:42Z

This is re. cpu version. Not sure about gpu version yet. In my recent experiments gpu hist is commonly outperforming cpu hist by a very small amount.

hcho3 · 2018-06-28T15:52:58Z

@RAMitchell I am willing to investigate it if there is a re-producible example that demonstrates the lower model accuracy. At any rate, I wrote the CPU-hist code when I was pretty new to XGBoost, so I'd like to come back to it and make it better. (One of the glaring short-coming is that it doesn't support distributed training yet.)

CodingCat · 2018-06-28T18:01:22Z

@hcho3 is there any recent plan to support distributed training?

hcho3 · 2018-06-28T18:26:36Z

@CodingCat Not any, as far as I am aware of. I had some people inquire me about a distributed hist updater. How important do you think it is? There are some commonalities between 'approx' and 'hist', one of which being that quantiles are used as split candidates. The major difference is that for 'hist', you start by quantizing the data matrix, enabling some optimizations.

EDIT. If distributed 'hist' is deemed to be important, I can bring it up to my manager to carve out time to have it implemented.
EDIT2. For this summer, I am mentoring an intern who is like to improve distributed training in XGBoost.

CodingCat · 2018-06-28T21:47:43Z

regarding the importance of hist, I would say in many companies like my current employer, distributed training is the major use case and having a faster algorithm is definitely helpful for these users

RAMitchell · 2018-06-28T23:55:33Z

@hcho3 here are a few data points from my recent experiments (https://github.com/RAMitchell/GBM-Benchmarks)

I think your hist algorithm is extensively used so it would be extremely high value to get some development time on it by you or others.

hcho3 · 2018-06-29T16:06:39Z

@RAMitchell Thanks for posting the benchmarks. I will take a look at it. As for the hist algorithm, yes, I'll try to get dev time with either myself or someone else.

@CodingCat Can we arrange for an in-person meeting within next two weeks? (I am currently in Seattle.) I'd like to hear more about your thoughts with regard to future priorities for XGBoost development. The intern I am mentoring would like to meet you as well. If you are available for a meeting, please e-mail me at chohyu01 (at) cs.washington.edu.

jq · 2018-07-02T03:03:01Z

+1 for this feature @hcho3

Laurae2 · 2018-07-15T20:54:26Z

@hcho3 Also to take into account, xgboost CPU histogram is slow mainly because it uses 64 threads (32 physical cores). 1 thread is sometimes faster in my benchmarks for very similar parameters, even with the frequency advantage.

Here is an example for 500 iterations on Bosch (depth 6) using an i7-7700K, 4.5 GHz with more data (1 million); vs the reported 810 seconds with 64 threads (60% data of approx 1.2 million):

Threads	Time
1	340s
2	202s
3	167s
4	162s
5	166s
6	171s
7	172s
8	176s

RAMitchell · 2018-07-15T22:11:59Z

This is still a problem because the default uses all CPU threads. We could internally limit the number of threads used by the hist algorithm as a quick fix but it would be nicer to get to the root of the problem.

hcho3 · 2020-12-16T21:49:34Z

Closing this, since XGBoost has progress substantially in terms of performance: #3810, szilard/GBM-perf#41. As for accuracy, there are several factors involved:

Whether to use depthwise or lossguide in growing trees. LightGBM only offers lossguide equivalent, whereas XGBoost offers both.
Whether to directly encode categorical data or require one-hot encoding. XGBoost currently requires one-hot encoding, whereas LightGBM allows direct splits of categorical features. However, see Categorical data support. #6503

Also, XGBoost has gained back the mind share. See this Twitter poll. XGBoost has the state-of-the-art performance on GPU, and it's got cutting-edge integration with Dask.

tqchen closed this as completed Jul 4, 2018

tqchen reopened this Jul 4, 2018

hcho3 added priority: high status: help wanted labels Jul 4, 2018

hcho3 removed the priority: high label Oct 9, 2019

hcho3 closed this as completed Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightGBM vs XGBoost accuracy/speed #3417

LightGBM vs XGBoost accuracy/speed #3417

RAMitchell commented Jun 28, 2018

hcho3 commented Jun 28, 2018

RAMitchell commented Jun 28, 2018

hcho3 commented Jun 28, 2018 •

edited

Loading

CodingCat commented Jun 28, 2018

hcho3 commented Jun 28, 2018 •

edited

Loading

CodingCat commented Jun 28, 2018

RAMitchell commented Jun 28, 2018 •

edited

Loading

hcho3 commented Jun 29, 2018

jq commented Jul 2, 2018

Laurae2 commented Jul 15, 2018 •

edited

Loading

RAMitchell commented Jul 15, 2018 •

edited

Loading

hcho3 commented Dec 16, 2020 •

edited

Loading

LightGBM vs XGBoost accuracy/speed #3417

LightGBM vs XGBoost accuracy/speed #3417

Comments

RAMitchell commented Jun 28, 2018

hcho3 commented Jun 28, 2018

RAMitchell commented Jun 28, 2018

hcho3 commented Jun 28, 2018 • edited Loading

CodingCat commented Jun 28, 2018

hcho3 commented Jun 28, 2018 • edited Loading

CodingCat commented Jun 28, 2018

RAMitchell commented Jun 28, 2018 • edited Loading

hcho3 commented Jun 29, 2018

jq commented Jul 2, 2018

Laurae2 commented Jul 15, 2018 • edited Loading

RAMitchell commented Jul 15, 2018 • edited Loading

hcho3 commented Dec 16, 2020 • edited Loading

hcho3 commented Jun 28, 2018 •

edited

Loading

hcho3 commented Jun 28, 2018 •

edited

Loading

RAMitchell commented Jun 28, 2018 •

edited

Loading

Laurae2 commented Jul 15, 2018 •

edited

Loading

RAMitchell commented Jul 15, 2018 •

edited

Loading

hcho3 commented Dec 16, 2020 •

edited

Loading