-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightGBM vs XGBoost accuracy/speed #3417
Comments
@kaz-Anova @RAMitchell Is the issue specific with CPU-hist or do you see the same issue with GPU-hist as well? |
This is re. cpu version. Not sure about gpu version yet. In my recent experiments gpu hist is commonly outperforming cpu hist by a very small amount. |
@RAMitchell I am willing to investigate it if there is a re-producible example that demonstrates the lower model accuracy. At any rate, I wrote the CPU-hist code when I was pretty new to XGBoost, so I'd like to come back to it and make it better. (One of the glaring short-coming is that it doesn't support distributed training yet.) |
@hcho3 is there any recent plan to support distributed training? |
@CodingCat Not any, as far as I am aware of. I had some people inquire me about a distributed hist updater. How important do you think it is? There are some commonalities between 'approx' and 'hist', one of which being that quantiles are used as split candidates. The major difference is that for 'hist', you start by quantizing the data matrix, enabling some optimizations. EDIT. If distributed 'hist' is deemed to be important, I can bring it up to my manager to carve out time to have it implemented. |
regarding the importance of hist, I would say in many companies like my current employer, distributed training is the major use case and having a faster algorithm is definitely helpful for these users |
@hcho3 here are a few data points from my recent experiments (https://github.com/RAMitchell/GBM-Benchmarks) |
@RAMitchell Thanks for posting the benchmarks. I will take a look at it. As for the hist algorithm, yes, I'll try to get dev time with either myself or someone else. @CodingCat Can we arrange for an in-person meeting within next two weeks? (I am currently in Seattle.) I'd like to hear more about your thoughts with regard to future priorities for XGBoost development. The intern I am mentoring would like to meet you as well. If you are available for a meeting, please e-mail me at chohyu01 (at) cs.washington.edu. |
+1 for this feature @hcho3 |
@hcho3 Also to take into account, xgboost CPU histogram is slow mainly because it uses 64 threads (32 physical cores). 1 thread is sometimes faster in my benchmarks for very similar parameters, even with the frequency advantage. Here is an example for 500 iterations on Bosch (depth 6) using an i7-7700K, 4.5 GHz with more data (1 million); vs the reported 810 seconds with 64 threads (60% data of approx 1.2 million):
|
This is still a problem because the default uses all CPU threads. We could internally limit the number of threads used by the hist algorithm as a quick fix but it would be nicer to get to the root of the problem. |
Closing this, since XGBoost has progress substantially in terms of performance: #3810, szilard/GBM-perf#41. As for accuracy, there are several factors involved:
Also, XGBoost has gained back the mind share. See this Twitter poll. XGBoost has the state-of-the-art performance on GPU, and it's got cutting-edge integration with Dask. |
@kaz-Anova recently pointed out that XGBoost is falling behind LightGBM in accuracy on recent Kaggle competitions.
This is referring to the "tree_method":"hist" algorithm. If anyone has time it would be nice to figure out the root cause of this.
Speed is also a priority but I think less so than accuracy.
cc @hcho3
The text was updated successfully, but these errors were encountered: