[WIP] Use spark barrier mode to ensure all xgboost worker launched parallelly #5625

WeichenXu123 · 2020-05-01T14:40:47Z

Current xgoost-4j-spark use SparkParallelismTracker to wait and check whether there's enough task slots. This approach has two issue:

If we set spark cluster executor to be auto scale mode, then this approach won't work. If worker is not enough it won't trigger more executor allocation.
Suppose two xgboost training started parallelly and task slots are not enough for parallel run the two training jobs. It is possible that each job allocated part of the slots and then they both get stuck
in dead lock.

In spark 2.4, barrier mode was introduced, which used to ensure a spark job stage launch all tasks parallelly. This address the issue above.

In this PR , I remove SparkParallelismTracker and update code to use spark barrier mode.

Note: now we cannot set xgboost params timeout_request_workers for each individual job. We have to set spark.scheduler.barrier.maxConcurrentTasksCheck.interval and spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures in spark cluster config. The timeout to wait for enough workers will be equal to spark.scheduler.barrier.maxConcurrentTasksCheck.interval * spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures

Will add test soon but it is ready for first pass review.

CodingCat · 2020-05-02T01:44:30Z

is there any battle-tested cases of barrier execution in Spark?

I didn't see enough ROI for such a fundamental change....(and I didn't see changes covering the original behavior that, e.g., stop the application after a single worker failed )

WeichenXu123 · 2020-05-02T01:51:08Z

@CodingCat

... stop the application after a single worker failed

In barrier mode, if one worker failed, all other workers will be also killed and spark job failed.

ROI for such a fundamental change ?

The main benefit will be If we set spark cluster executor to be auto scale mode, then old approach won't work. If worker is not enough it won't trigger more executor allocation. But barrier mode will address it, it will trigger spark cluster to create more executors.

What about we keep old behavior, but add an option for add new barrier mode ?

CodingCat · 2020-05-02T01:51:13Z

If we set spark cluster executor to be auto scale mode, then this approach won't work. If worker is not enough it won't trigger more executor allocation.

you mean dynamic allocation? then just check whether users sets minExecutors in dynamic allocation no smaller than numWorkers

Suppose two xgboost training started parallelly and task slots are not enough for parallel run the two training jobs. It is possible that each job allocated part of the slots and then they both get stuck in dead lock.

we never officially support such cases (due to some issues in rabit layer)...even we support, to work around these, add several lines to detect

if static allocation, when launching a training, always check if it is possible to get enough resources given the configured number of executors

if dynamic allocation, I think we can do similarly, (but I really don't think we should spend too much time in dynamic allocation...in Uber, dynamic allocation leads to many difficulties for tuning)

WeichenXu123 · 2020-05-02T01:56:59Z

@CodingCat

you mean dynamic allocation? then just check whether users sets minExecutors in dynamic allocation no smaller than numWorkers

This don't address issue I think. e.g., if one xgboost job occupy current executors resources, and started a new xgboost job on the same spark cluster, spark should dynamic allocate more executors rather than waiting task slots on current executors become idle.

trivialfis · 2020-05-02T23:15:30Z

@CodingCat

due to some issues in rabit layer

Could you please elaborate a bit on this?

liangz1

Some nits

liangz1 · 2020-05-04T06:09:25Z

jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala

@@ -548,9 +546,6 @@ object XGBoost extends Serializable {
      // Train for every ${savingRound} rounds and save the partially completed booster
      val tracker = startTracker(xgbExecParams.numWorkers, xgbExecParams.trackerConf)
      val (booster, metrics) = try {
-        val parallelismTracker = new SparkParallelismTracker(sc,
-          xgbExecParams.timeoutRequestWorkers,


timeoutRequestWorkers can be removed from XGBoostExecutionParams.

liangz1 · 2020-05-04T06:09:45Z

jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala

        logger.info(s"Rabit returns with exit code $trackerReturnVal")
-        val (booster, metrics) = postTrackerReturnProcessing(trackerReturnVal,


L665-692 private def postTrackerReturnProcessing(... is no longer used.

WeichenXu123 · 2020-05-07T08:21:21Z

@CodingCat

Suppose two xgboost training started parallelly and task slots are not enough for parallel run the two training jobs. It is possible that each job allocated part of the slots and then they both get stuck in dead lock.

we never officially support such cases (due to some issues in rabit layer)...even we support, to work around these, add several lines to detect

There is a senario that use spark CrossValidation on Xgboost estimator, it will launch Xgboost jobs parallelly. If do not address this issue, deadlock may happen. And seemingly we have no way to workaround (such as add several lines to detect, it does not work, e.g., if two threads check resources and both get available response, then they start there own xgboost jobs parallely, then deadlock will still happen)

And what's the issue you mentioned in "due to some issues in rabit layer" ?

What do you think ?
Thanks!

WeichenXu123 added 3 commits May 1, 2020 22:28

init

4ed4330

fix

4d38a9e

fix

95e4476

fix

8af400a

fix

ecba18a

liangz1 reviewed May 4, 2020

View reviewed changes

WeichenXu123 closed this Oct 25, 2021

wbo4958 mentioned this pull request Apr 22, 2022

[FEA] [JVM-Packages] Add barrier execution mode for support #7835

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Use spark barrier mode to ensure all xgboost worker launched parallelly #5625

[WIP] Use spark barrier mode to ensure all xgboost worker launched parallelly #5625

WeichenXu123 commented May 1, 2020 •

edited

Loading

CodingCat commented May 2, 2020

WeichenXu123 commented May 2, 2020 •

edited

Loading

CodingCat commented May 2, 2020

WeichenXu123 commented May 2, 2020

trivialfis commented May 2, 2020

liangz1 left a comment

liangz1 May 4, 2020

liangz1 May 4, 2020

WeichenXu123 commented May 7, 2020

		logger.info(s"Rabit returns with exit code $trackerReturnVal")
		val (booster, metrics) = postTrackerReturnProcessing(trackerReturnVal,

[WIP] Use spark barrier mode to ensure all xgboost worker launched parallelly #5625

[WIP] Use spark barrier mode to ensure all xgboost worker launched parallelly #5625

Conversation

WeichenXu123 commented May 1, 2020 • edited Loading

CodingCat commented May 2, 2020

WeichenXu123 commented May 2, 2020 • edited Loading

CodingCat commented May 2, 2020

WeichenXu123 commented May 2, 2020

trivialfis commented May 2, 2020

liangz1 left a comment

Choose a reason for hiding this comment

liangz1 May 4, 2020

Choose a reason for hiding this comment

liangz1 May 4, 2020

Choose a reason for hiding this comment

WeichenXu123 commented May 7, 2020

WeichenXu123 commented May 1, 2020 •

edited

Loading

WeichenXu123 commented May 2, 2020 •

edited

Loading