Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyspark] Allow to avoid repartition #10408

Merged
merged 4 commits into from
Jun 11, 2024
Merged

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Jun 11, 2024

The current xgboost pyspark will not repartition the dataset only when the last operator of input dataset is repartition and the repartitioned number is equal to num_workers. Overall, this looks good. But we should also keep in mind that repartition is really expensive especially for the GPU case, the repartition probably dominates the most of training time.

There is a real xgboost case that reading from file directly using spark and then fit the data into xgboost estimator. So for this kind of case, we can make the partition number equal to num_workers by playing with some spark configurations, and the data partitions should have been well balanced by spark. So I think, it's safe for xgboost to skip the repartition internally which can really improve the whole xgboost end to end time.

And on the other hand, xgboost already supports the force_repartition, so if user really would like to enable the repartition, they can set force_repartition to true.

Comment on lines 709 to 716
assert dataset._sc._jvm is not None
query_plan = dataset._sc._jvm.PythonSQLUtils.explainString(
dataset._jdf.queryExecution(), "extended"
)
start = query_plan.index("== Optimized Logical Plan ==")
start += len("== Optimized Logical Plan ==") + 1

query_plan[start : start + len("Repartition")] == "Repartition"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the commented code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

XGBoost needs to repartition to the num_workers to ensure there will be num_workers training
tasks running at the same time, but repartition is a costly operation. To avoid repartition,
users can set ``spark.sql.files.maxPartitionNum`` and ``spark.sql.files.minPartitionNum``
to num_workers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate on when a user might want to force reparittion dataset as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@trivialfis trivialfis merged commit cf0c1d0 into dmlc:master Jun 11, 2024
29 of 31 checks passed
trivialfis pushed a commit to trivialfis/xgboost that referenced this pull request Jun 12, 2024
trivialfis pushed a commit to trivialfis/xgboost that referenced this pull request Jun 12, 2024
trivialfis added a commit that referenced this pull request Jun 13, 2024
@trivialfis trivialfis mentioned this pull request Jun 13, 2024
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants