[EPIC] Allow failed worker retry in distributed training #4753

chenqin · 2019-08-08T17:25:16Z

trivialfis · 2019-08-12T15:04:18Z

Looks interesting. ;-)

trams · 2019-08-20T21:07:00Z

This looks interesting indeed. One question: "What is native distributed XGB training?" How can I use it? Is it the one which involves a python package + dask?

chenqin · 2019-08-23T17:50:20Z

This looks interesting indeed. One question: "What is native distributed XGB training?" How can I use it? Is it the one which involves a python package + dask?

If you try xgboost with more than one machine. You are already using rabit, we added feature and track changes in progress to xgb layer

https://www.slideshare.net/ChenQin1/scaling-xgboost

trams · 2019-08-23T18:44:58Z

@chenqin, I understand that xgboost uses rabit as AllReduce implementation. I still don't really know what is a native distributed XGB training?
Does xgboost-spark run native distributed XGB training? If yes then why do we add "native" here? What other distributed XGB training do we have?

I am new to the project and I am still discovering different ways to use xgboost so this is an honest question.

chenqin · 2019-08-23T20:05:20Z

@chenqin, I understand that xgboost uses rabit as AllReduce implementation. I still don't really know what is a native distributed XGB training?
Does xgboost-spark run native distributed XGB training? If yes then why do we add "native" here? What other distributed XGB training do we have?

I am new to the project and I am still discovering different ways to use xgboost so this is an honest question.

In some use cases where user don't have complex feature generating needs, user can launch xgboost native worker (c++) without data processing
framework. https://github.com/kubeflow/xgboost-operator

…lc#4753

chenqin · 2019-08-29T05:59:35Z

from description in this thread, yes, dask-xgboost leverage rabit as well.
#2032

hcho3 · 2019-09-19T17:49:04Z

@chenqin It's epic :)

nateagr · 2019-12-11T17:06:46Z

Hi there! Any update on this epic ? Thanks.

chenqin · 2019-12-14T03:20:15Z

Hi there! Any update on this epic ? Thanks.

Yes, it's going (holiday session slow), still miss the final piece of patch XGBoost-spark after we can get current open pr landed.

trivialfis · 2020-09-27T08:16:09Z

I don't think this is possible in short term. Let's stick with fail all strategy for now. We @hcho3 are thinking about redesigning RABIT from scratch.

chenqin mentioned this issue Aug 8, 2019

[Roadmap] XGBoost 1.0.0 Roadmap #4680

Closed

9 tasks

chenqin added a commit to chenqin/xgboost that referenced this issue Aug 29, 2019

workaround booster save/load inconsistency, leave fix to item 3 in dm…

6e1bdca

…lc#4753

chenqin mentioned this issue Sep 4, 2019

[jvm-packages] retry-able xgboost4j-spark booster #4831

Closed

chenqin changed the title ~~Allow failed worker retry in distributed training~~ [EPIC] Allow failed worker retry in distributed training Sep 19, 2019

trivialfis closed this as completed Sep 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Allow failed worker retry in distributed training #4753

[EPIC] Allow failed worker retry in distributed training #4753

chenqin commented Aug 8, 2019 •

edited

Loading

trivialfis commented Aug 12, 2019

trams commented Aug 20, 2019

chenqin commented Aug 23, 2019

trams commented Aug 23, 2019

chenqin commented Aug 23, 2019

chenqin commented Aug 29, 2019

hcho3 commented Sep 19, 2019

nateagr commented Dec 11, 2019

chenqin commented Dec 14, 2019

trivialfis commented Sep 27, 2020

[EPIC] Allow failed worker retry in distributed training #4753

[EPIC] Allow failed worker retry in distributed training #4753

Comments

chenqin commented Aug 8, 2019 • edited Loading

trivialfis commented Aug 12, 2019

trams commented Aug 20, 2019

chenqin commented Aug 23, 2019

trams commented Aug 23, 2019

chenqin commented Aug 23, 2019

chenqin commented Aug 29, 2019

hcho3 commented Sep 19, 2019

nateagr commented Dec 11, 2019

chenqin commented Dec 14, 2019

trivialfis commented Sep 27, 2020

chenqin commented Aug 8, 2019 •

edited

Loading