-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Allow failed worker retry in distributed training #4753
Comments
Looks interesting. ;-) |
This looks interesting indeed. One question: "What is native distributed XGB training?" How can I use it? Is it the one which involves a python package + dask? |
If you try xgboost with more than one machine. You are already using rabit, we added feature and track changes in progress to xgb layer |
@chenqin, I understand that xgboost uses rabit as AllReduce implementation. I still don't really know what is a native distributed XGB training? I am new to the project and I am still discovering different ways to use xgboost so this is an honest question. |
In some use cases where user don't have complex feature generating needs, user can launch xgboost native worker (c++) without data processing |
from description in this thread, yes, dask-xgboost leverage rabit as well. |
@chenqin It's epic :) |
Hi there! Any update on this epic ? Thanks. |
Yes, it's going (holiday session slow), still miss the final piece of patch XGBoost-spark after we can get current open pr landed. |
I don't think this is possible in short term. Let's stick with fail all strategy for now. We @hcho3 are thinking about redesigning RABIT from scratch. |
Fault recovery in native distributed XGB training (not xgb-spark) have been broken for more than a year. Community have seen various of issue like this dmlc/rabit#63
design doc.pdf
PR candidate: support bootstrap allreduce/broadcast rabit#98
pr remove is_bootstrap parameter rabit#102 cleanup rabit api
pr [rabit_bootstrap_cache ] failed xgb worker recover from other workers #4808 patch xgb worker
pr cover change with gtest unittests, clean up cmake script and code includes rabit#106 unittest rabit, cleanup make file
design [jvm-packages] retry-able xgboost4j-spark booster #4831
tracking backlogs work from merged prs
The text was updated successfully, but these errors were encountered: