You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rabit.init() is like a barrier, and it will wait for other workers for synchronization. If one worker died before Rabit.init(), then it will hang.
During training, if one work died, it will call Rabit.finalize(), other workers may wait on allreduce, then it hang.
How xgboost resolve this issue. It is killing SparkContext, which is a really bad user experience. On the other hand, Killing SparkContext is depending on SparkListener, sometimes even if the task failed, xgboost can't receive the notification in some cases (I don't know why), so the killing threads will not be started, then it will hang forever.
What xgboost can benefit from barrier execution mode.
the xgboost tasks can be launched at once
Previously, if there is available slot, Task scheduled will offer some (not all) xgboost task with the resource, so some tasks may run in advance, while others may still wait for the resource.
abort the barrier stage if one task failed.
The one feature of barrier stage is if one task failed, Spark DAG/TaskScheduler will kill all other running tasks, which does not depend on any spark listener. With this, xgboost will not be hanged forever. At the same time, we can get rid of xgboost sparkcontext killer.
Side effect
Spark does not support barrier mode and dynamic allocation at the same time. But xgboost may have some un-expected behavior when enabling dynamic allocation, see #5625 (comment). So I think it is ok.
The text was updated successfully, but these errors were encountered:
wbo4958
changed the title
[FEA] Add barrier execution mode for support
[FEA] [JVM-Packages] Add barrier execution mode for support
Apr 22, 2022
This FEA is based on #5625 and #4793.
Why we need barrier execution mode
The Rabit hanging issue, for example,
How xgboost resolve this issue. It is killing SparkContext, which is a really bad user experience. On the other hand, Killing SparkContext is depending on SparkListener, sometimes even if the task failed, xgboost can't receive the notification in some cases (I don't know why), so the killing threads will not be started, then it will hang forever.
What xgboost can benefit from barrier execution mode.
Previously, if there is available slot, Task scheduled will offer some (not all) xgboost task with the resource, so some tasks may run in advance, while others may still wait for the resource.
The one feature of barrier stage is if one task failed, Spark DAG/TaskScheduler will kill all other running tasks, which does not depend on any spark listener. With this, xgboost will not be hanged forever. At the same time, we can get rid of xgboost sparkcontext killer.
Side effect
Spark does not support barrier mode and dynamic allocation at the same time. But xgboost may have some un-expected behavior when enabling dynamic allocation, see #5625 (comment). So I think it is ok.
The text was updated successfully, but these errors were encountered: