-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pyspark] Allow to avoid repartition #10408
Conversation
python-package/xgboost/spark/core.py
Outdated
assert dataset._sc._jvm is not None | ||
query_plan = dataset._sc._jvm.PythonSQLUtils.explainString( | ||
dataset._jdf.queryExecution(), "extended" | ||
) | ||
start = query_plan.index("== Optimized Logical Plan ==") | ||
start += len("== Optimized Logical Plan ==") + 1 | ||
|
||
query_plan[start : start + len("Repartition")] == "Repartition" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the commented code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/tutorials/spark_estimator.rst
Outdated
XGBoost needs to repartition to the num_workers to ensure there will be num_workers training | ||
tasks running at the same time, but repartition is a costly operation. To avoid repartition, | ||
users can set ``spark.sql.files.maxPartitionNum`` and ``spark.sql.files.minPartitionNum`` | ||
to num_workers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate on when a user might want to force reparittion dataset as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Co-authored-by: Bobby Wang <[email protected]>
The current xgboost pyspark will not repartition the dataset only when the last operator of input dataset is repartition and the repartitioned number is equal to num_workers. Overall, this looks good. But we should also keep in mind that repartition is really expensive especially for the GPU case, the repartition probably dominates the most of training time.
There is a real xgboost case that reading from file directly using spark and then fit the data into xgboost estimator. So for this kind of case, we can make the partition number equal to num_workers by playing with some spark configurations, and the data partitions should have been well balanced by spark. So I think, it's safe for xgboost to skip the repartition internally which can really improve the whole xgboost end to end time.
And on the other hand, xgboost already supports the force_repartition, so if user really would like to enable the repartition, they can set force_repartition to true.