Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] [JVM-Packages] Add barrier execution mode for support #7835

Closed
wbo4958 opened this issue Apr 22, 2022 · 0 comments · Fixed by #7836
Closed

[FEA] [JVM-Packages] Add barrier execution mode for support #7835

wbo4958 opened this issue Apr 22, 2022 · 0 comments · Fixed by #7836

Comments

@wbo4958
Copy link
Contributor

wbo4958 commented Apr 22, 2022

This FEA is based on #5625 and #4793.

Why we need barrier execution mode

The Rabit hanging issue, for example,

  1. Rabit.init() is like a barrier, and it will wait for other workers for synchronization. If one worker died before Rabit.init(), then it will hang.
  2. During training, if one work died, it will call Rabit.finalize(), other workers may wait on allreduce, then it hang.

How xgboost resolve this issue. It is killing SparkContext, which is a really bad user experience. On the other hand, Killing SparkContext is depending on SparkListener, sometimes even if the task failed, xgboost can't receive the notification in some cases (I don't know why), so the killing threads will not be started, then it will hang forever.

What xgboost can benefit from barrier execution mode.

  1. the xgboost tasks can be launched at once

Previously, if there is available slot, Task scheduled will offer some (not all) xgboost task with the resource, so some tasks may run in advance, while others may still wait for the resource.

  1. abort the barrier stage if one task failed.

The one feature of barrier stage is if one task failed, Spark DAG/TaskScheduler will kill all other running tasks, which does not depend on any spark listener. With this, xgboost will not be hanged forever. At the same time, we can get rid of xgboost sparkcontext killer.

Side effect

Spark does not support barrier mode and dynamic allocation at the same time. But xgboost may have some un-expected behavior when enabling dynamic allocation, see #5625 (comment). So I think it is ok.

@wbo4958 wbo4958 changed the title [FEA] Add barrier execution mode for support [FEA] [JVM-Packages] Add barrier execution mode for support Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants