Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Allow failed worker retry in distributed training #4753

Closed
6 of 10 tasks
chenqin opened this issue Aug 8, 2019 · 10 comments
Closed
6 of 10 tasks

[EPIC] Allow failed worker retry in distributed training #4753

chenqin opened this issue Aug 8, 2019 · 10 comments

Comments

@chenqin
Copy link
Contributor

chenqin commented Aug 8, 2019

Fault recovery in native distributed XGB training (not xgb-spark) have been broken for more than a year. Community have seen various of issue like this dmlc/rabit#63

tracking backlogs work from merged prs

@trivialfis
Copy link
Member

Looks interesting. ;-)

@trams
Copy link
Contributor

trams commented Aug 20, 2019

This looks interesting indeed. One question: "What is native distributed XGB training?" How can I use it? Is it the one which involves a python package + dask?

@chenqin
Copy link
Contributor Author

chenqin commented Aug 23, 2019

This looks interesting indeed. One question: "What is native distributed XGB training?" How can I use it? Is it the one which involves a python package + dask?

If you try xgboost with more than one machine. You are already using rabit, we added feature and track changes in progress to xgb layer

https://www.slideshare.net/ChenQin1/scaling-xgboost

@trams
Copy link
Contributor

trams commented Aug 23, 2019

@chenqin, I understand that xgboost uses rabit as AllReduce implementation. I still don't really know what is a native distributed XGB training?
Does xgboost-spark run native distributed XGB training? If yes then why do we add "native" here? What other distributed XGB training do we have?

I am new to the project and I am still discovering different ways to use xgboost so this is an honest question.

@chenqin
Copy link
Contributor Author

chenqin commented Aug 23, 2019

@chenqin, I understand that xgboost uses rabit as AllReduce implementation. I still don't really know what is a native distributed XGB training?
Does xgboost-spark run native distributed XGB training? If yes then why do we add "native" here? What other distributed XGB training do we have?

I am new to the project and I am still discovering different ways to use xgboost so this is an honest question.

In some use cases where user don't have complex feature generating needs, user can launch xgboost native worker (c++) without data processing
framework. https://github.com/kubeflow/xgboost-operator

@chenqin
Copy link
Contributor Author

chenqin commented Aug 29, 2019

from description in this thread, yes, dask-xgboost leverage rabit as well.
#2032

@chenqin chenqin changed the title Allow failed worker retry in distributed training [EPIC] Allow failed worker retry in distributed training Sep 19, 2019
@hcho3
Copy link
Collaborator

hcho3 commented Sep 19, 2019

@chenqin It's epic :)

@nateagr
Copy link
Contributor

nateagr commented Dec 11, 2019

Hi there! Any update on this epic ? Thanks.

@chenqin
Copy link
Contributor Author

chenqin commented Dec 14, 2019

Hi there! Any update on this epic ? Thanks.

Yes, it's going (holiday session slow), still miss the final piece of patch XGBoost-spark after we can get current open pr landed.

@trivialfis
Copy link
Member

I don't think this is possible in short term. Let's stick with fail all strategy for now. We @hcho3 are thinking about redesigning RABIT from scratch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants