-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] add a new trainDistributed API for dataset #5950
Conversation
This PR adds a new API trainDistributed for dataset, and creates a TrainIntf trait and four different implementations (train for rank with eval, train for rank without eval, train for nonrank with eval, train for nonrank without eval) This PR does not change any logical of train, it's just like code-reorganization
Hi @CodingCat @trivialfis @hcho3 @sperlingxx, Since gpu_hist #5171 has been merged, JVM can also be accelerated by GPUs in Spark3. But Looking at the whole XGBoost workflow (loading data, ETL, train), specifically, It is only the training process now can be accelerated by GPUs. But Training models or doing prediction, that’s just a pretty small portion of the whole work (especially training has been accelerated by GPUs). Most of our work will be spent on data preparation (data loading, ETL), so the data preparation may slow down the whole pipeline. So the question is can we also accelerate the ETL pipeline? As of Apache Spark release 3.0 users can schedule GPU resources and can replace the backend for many SQL and dataframe operations so that they are accelerated using GPUs. Actually It is Rapids-plugin/CUDF https://github.com/NVIDIA/spark-rapids which can accelerate SQL/Dataframe. What I am trying to say is we can leverage Rapids-plugin to accelerate XGBoost4j-Spark end to end from data loading to training. Rapids-plugin loads data into GPU memory with Columnar format and executes operations like join/filter/groupBy with GPUs, that's really pretty fast. After ETL, the data is still stored into GPU memory, So we can build Device-DMatrix in GPU memory, actually xgboost library has supported the cudf_interface way to build DMatrix. Rapids-plugin/CUDF, I know, is in the rapidly developing. So, I would not like to add rapids-plugin into dependency directly. Instead, I'd like to add an XGBoost plugin to support this. In order to support plugin for XGBoost. I'd like to have 2 PRs The first PR (this PR), moving the RDD transforms closely, then we can replace the whole RDD transform in the plugin.To achieve this, 1. deprecated the trainDistributed for RDD /**
* @return A tuple of the booster and the metrics used to build training summary
*/
@deprecated("use trainDistributed for Dataset instead of RDD", "1.2.0")
@throws(classOf[XGBoostError])
private[spark] def trainDistributed(
trainingData: RDD[XGBLabeledPoint],
params: Map[String, Any],
hasGroup: Boolean = false,
evalSetsMap: Map[String, RDD[XGBLabeledPoint]] = Map()):
(Booster, Map[String, Array[Float]]) = { 2. added new trainDistributed for Dataset[_] /**
* @return A tuple of the booster and the metrics used to build training summary
*/
@throws(classOf[XGBoostError])
private[spark] def trainDistributed(
trainingData: Dataset[_],
dsToRddParams: DataFrameToRDDParams,
params: Map[String, Any],
hasGroup: Boolean,
evalData: Map[String, Dataset[_]]): (Booster, Map[String, Array[Float]]) = { 2.1. added a function in trainDistributed to choose the implementation of trainval trainImpl = chooseTrainImpl(hasGroup, xgbExecParams, evalData) eg, if detecting XGBoost Plugin, then the trainImpl should be pointing to Plugin implementation, or else, falling back to default CPU pipeline 2.2. converted Dataset to RDD and build Watches val boostersAndMetrics = trainImpl.datasetToRDD(trainingData, dsToRddParams).mapPartitions(
iter => {
val watches = trainImpl.buildWatches(iter)
buildDistributedBooster(watches, xgbExecParams, rabitEnv,
xgbExecParams.obj, xgbExecParams.eval, prevBooster)
}).cache() the above code is the main framework. trainingData is a Dataset[_]. The main framework does not concern how trainImpl is implemented, by Plugin or default CPU implementation. actually, it treats everything is plugin. So no matter that it is Plugin or CPU pipeline, Both should implement datasetToRDD and buildWatches. BTW, This PR mainly re-architect CPU pipeline, and it will not change the original logic. the second PRThe second PR should be plugin detecting and replacing. That should be simple Talk is cheap, Please check the code. Many thx for your precious time. Any suggestions and comments are appreciated. Thx in advance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this has to be done in XGBoost/XGBoost.scala? XGBoost.scala exposes RDD based API to XGBoostClassifier/Regressor to use instead of end users,
you can use various approach to accelerate the construction of DataFrame and pass the result DF to XGBoostClassifier/Regressor...and then start training with GPU-HIST
I don't understand why leaking the data preprocessing to XGBoost.scala and even XGBoost
Hi @CodingCat, really thx for the review.
Yeah, Seems RDD based API of XGBoost.scala is only for RDD[XGBLabeledPoint], and some RDD transforms which also is based on XGBLabeledPoint. Finally, building a booster which is based on XGBLabeledPoint. So you can see, XGBoost.scala is coupling with XGBLabeledPoint. The Plugin hopes XGBoost.scala to focus on the role of building DMatrix and pass DMatrix to XGB library and train, finally return the Model. As to how to build DMatrix, different plugins may have different implementations. For example, CPU pipeline is utilizing XGDMatrixCreateFromDataIter which based on Iterator to build DMatrix, while The GPU pipeline may need XGBoosterPredictFromArrayInterfaceColumns which is based on the array interface json format, not LabeledPoint any more.
Yeah, If we use spark-rapids to implement the Plugin. Considering this situation, the spark-rapids (plugin of Spark can accelerate most of sql/dataframe operators) loads the data into GPU memory with Columnar format from the beginning and then does some operations like filter/agg/join in the GPU memory. If we still use RDD[XGBLabeledPoint] which means we need to convert the Columnar to Row and construct LabeledPoint, which will cause the data copying from GPU to CPU. And then build DMatrix on CPU, finally train on GPU. But if we use plugin way and the XGBoosterPredictFromArrayInterfaceColumns to build DMatrix, then the memory will not fallback to CPU any more and DMatrix building also happens in GPU, it's really pretty fast.
|
@wbo4958 thanks for the explanation! now I understand why you choose to have this PR... regarding your last comparison
if we use RDD[XGBLabeldPoint] , sql/dataframe is not necessarily in CPU? as I said, you can build dataframe with GPU and then only building DMatrix would be on CPU (tho I agree that there is an extra memory copy overhead)
why don't we just adding a new API to accept RDD[DMatrix] in XGBoost.scala. we can do some type checking in XGBoost.scala, if RDD[DMatrix]is passed in , we can skip all transformations in current implementation and directly going to training phase in this way, we leave a complete flexibility to different hardware vendors to have their own implementation of data transformers, for XGBoost, we only care about DMatrix |
in XGBoostClassifier/XGBoostRegressor, seems we can also accept Dataset[DMatrix] and skip all logic currently just for CPU? |
trainImpl.cleanUpDriver() | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of code here are just copy paste from the original method?
at least we should provide some abstraction to avoid duplicate code
after checking the PR, I think there is some lack of understanding of the architecture of XGBoost-Spark. I would like to highlight the design principle of XGBoost.scala: this file should be kept as very low level, it integrates with Spark and XGBoost's JVM binding with their lowest level interface . Many years ago, when we started this project, it is through RDD[XGBoostLabeledPoint] which is the only option at that moment. When the new hw surges up, it seems this level of abstraction is partial, as Following this path, we should either improve XGBoostLabeledPoint to cover device memory which doesn't seem to be a good solution for many factors, or we should provide the integration with RDD[DMatrix]. As DMatrix is the unified interface to interact with XGBoost data. In the current PR, Dataset is a high level abstraction of Spark's distributed structured data (higher than RDD) which means providing more restriction on how this file can be used, |
Hi @CodingCat, many thx for your review in your spare time.
Thx for sharing the original design principle and Seem the design is quite reasonable.
For your two concerns, I may have my own explanation.
So if the XGBoost.scala is behaving like a bridge, then main framework of the bridge is converting dataset to RDD, build DMatrix in RDD, and finally, train. the bridge (XGBoost.scala) doesn't care about how to implement each step of the main framework of the bridge. it only cares about if the main framework is working. But I would like to say, RDD[DMatrix] is a brilliant idea, I have an initial PR for RDD[DMatrix] #5972 implementation. There may be one issue, When building DMatrix, XGBoost native may need to do some sync between different workers (right now, the sync is not enabled, but I suppose it will be enabled recently, @trivialfis please correct me). So the building DMatrix should be put into Rabit env. So in this case, seems we can't move the dmatrix building into rabit env. |
this can be closed since we are in favor of #5972? |
close as the latest contribution is at #5972 |
This PR adds a new API trainDistributed for dataset, and creates a
TrainIntf trait and four different implementations (train for rank with eval,
train for rank without eval, train for nonrank with eval, train for nonrank
without eval)
This PR does not change any logic of train pipeline, it's just like code-reorganization