[pyspark] Allow to avoid repartition #10408

wbo4958 · 2024-06-11T05:14:50Z

The current xgboost pyspark will not repartition the dataset only when the last operator of input dataset is repartition and the repartitioned number is equal to num_workers. Overall, this looks good. But we should also keep in mind that repartition is really expensive especially for the GPU case, the repartition probably dominates the most of training time.

There is a real xgboost case that reading from file directly using spark and then fit the data into xgboost estimator. So for this kind of case, we can make the partition number equal to num_workers by playing with some spark configurations, and the data partitions should have been well balanced by spark. So I think, it's safe for xgboost to skip the repartition internally which can really improve the whole xgboost end to end time.

And on the other hand, xgboost already supports the force_repartition, so if user really would like to enable the repartition, they can set force_repartition to true.

trivialfis · 2024-06-11T07:24:51Z

python-package/xgboost/spark/core.py

        assert dataset._sc._jvm is not None
        query_plan = dataset._sc._jvm.PythonSQLUtils.explainString(
            dataset._jdf.queryExecution(), "extended"
        )
        start = query_plan.index("== Optimized Logical Plan ==")
        start += len("== Optimized Logical Plan ==") + 1
+
+        query_plan[start : start + len("Repartition")] == "Repartition"


Please remove the commented code.

trivialfis · 2024-06-11T07:26:41Z

doc/tutorials/spark_estimator.rst

+XGBoost needs to repartition to the num_workers to ensure there will be num_workers training
+tasks running at the same time, but repartition is a costly operation. To avoid repartition,
+users can set ``spark.sql.files.maxPartitionNum`` and ``spark.sql.files.minPartitionNum``
+to num_workers.


Could you please elaborate on when a user might want to force reparittion dataset as well?

Co-authored-by: Bobby Wang <[email protected]>

[pyspark] skip checking query plan for repartition

252bce9

trivialfis added the Blocking label Jun 11, 2024

doc

8e5eddf

trivialfis requested changes Jun 11, 2024

View reviewed changes

wbo4958 added 2 commits June 11, 2024 17:21

update

f5e1d0e

update

1382f17

trivialfis approved these changes Jun 11, 2024

View reviewed changes

trivialfis merged commit cf0c1d0 into dmlc:master Jun 11, 2024
29 of 31 checks passed

trivialfis pushed a commit to trivialfis/xgboost that referenced this pull request Jun 12, 2024

[pyspark] Avoid repartition. (dmlc#10408)

5159a5d

trivialfis pushed a commit to trivialfis/xgboost that referenced this pull request Jun 12, 2024

[backport][pyspark] Avoid repartition. (dmlc#10408)

205e80b

trivialfis added a commit that referenced this pull request Jun 13, 2024

[backport][pyspark] Avoid repartition. (#10408) (#10411)

aa9818f

Co-authored-by: Bobby Wang <[email protected]>

trivialfis mentioned this pull request Jun 13, 2024

2.1.0 Release Roadmap #10339

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pyspark] Allow to avoid repartition #10408

[pyspark] Allow to avoid repartition #10408

wbo4958 commented Jun 11, 2024

trivialfis Jun 11, 2024

wbo4958 Jun 11, 2024

trivialfis Jun 11, 2024

wbo4958 Jun 11, 2024

[pyspark] Allow to avoid repartition #10408

[pyspark] Allow to avoid repartition #10408

Conversation

wbo4958 commented Jun 11, 2024

trivialfis Jun 11, 2024

Choose a reason for hiding this comment

wbo4958 Jun 11, 2024

Choose a reason for hiding this comment

trivialfis Jun 11, 2024

Choose a reason for hiding this comment

wbo4958 Jun 11, 2024

Choose a reason for hiding this comment