[FEA] [JVM-Packages] Add barrier execution mode for support #7835

wbo4958 · 2022-04-22T23:10:00Z

This FEA is based on #5625 and #4793.

Why we need barrier execution mode

The Rabit hanging issue, for example,

Rabit.init() is like a barrier, and it will wait for other workers for synchronization. If one worker died before Rabit.init(), then it will hang.
During training, if one work died, it will call Rabit.finalize(), other workers may wait on allreduce, then it hang.

How xgboost resolve this issue. It is killing SparkContext, which is a really bad user experience. On the other hand, Killing SparkContext is depending on SparkListener, sometimes even if the task failed, xgboost can't receive the notification in some cases (I don't know why), so the killing threads will not be started, then it will hang forever.

What xgboost can benefit from barrier execution mode.

the xgboost tasks can be launched at once

Previously, if there is available slot, Task scheduled will offer some (not all) xgboost task with the resource, so some tasks may run in advance, while others may still wait for the resource.

abort the barrier stage if one task failed.

The one feature of barrier stage is if one task failed, Spark DAG/TaskScheduler will kill all other running tasks, which does not depend on any spark listener. With this, xgboost will not be hanged forever. At the same time, we can get rid of xgboost sparkcontext killer.

Side effect

Spark does not support barrier mode and dynamic allocation at the same time. But xgboost may have some un-expected behavior when enabling dynamic allocation, see #5625 (comment). So I think it is ok.

wbo4958 changed the title ~~[FEA] Add barrier execution mode for support~~ [FEA] [JVM-Packages] Add barrier execution mode for support Apr 22, 2022

wbo4958 mentioned this issue Apr 22, 2022

[jvm-packages] bridge the gaps between jvm package and native xgboost #7802

Closed

34 tasks

trivialfis added the feature-request label Apr 23, 2022

wbo4958 mentioned this issue Apr 24, 2022

[Breaking][jvm-packages] add barrier execution mode #7836

Merged

trivialfis closed this as completed in #7836 Apr 25, 2022

trivialfis mentioned this issue Apr 28, 2022

[jvm-packages] Spark Barrier execution mode [RFC] #4793

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] [JVM-Packages] Add barrier execution mode for support #7835

[FEA] [JVM-Packages] Add barrier execution mode for support #7835

wbo4958 commented Apr 22, 2022 •

edited

Loading

[FEA] [JVM-Packages] Add barrier execution mode for support #7835

[FEA] [JVM-Packages] Add barrier execution mode for support #7835

Comments

wbo4958 commented Apr 22, 2022 • edited Loading

Why we need barrier execution mode

Side effect

wbo4958 commented Apr 22, 2022 •

edited

Loading