Difference in accuracy between distributed scala.spark.xgboost4j and single-node xgboost #5977

PivovarA · 2020-08-04T09:39:18Z

I noticed a difference in accuracy between distributed implementation of xgboost and single-node.
Moreover, the difference in accuracy changes with an increase in the number of estimators.
I would like to clarify if this behavior is expected and what could be the root cause of it?
I've prepared prepared a simple reproducer just to shows difference and how accuracy changes.

Steps to reproduce

>>> import pandas as pd
>>> pd.read_csv("/Users/aleksandr/Documents/xgboost_bench/boston.csv")
     Unnamed: 0     CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  PTRATIO       B  LSTAT  target
0             0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0     15.3  396.90   4.98    24.0
1             1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0     17.8  396.90   9.14    21.6
2             2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0     17.8  392.83   4.03    34.7
3             3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0     18.7  394.63   2.94    33.4
4             4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0     18.7  396.90   5.33    36.2
..          ...      ...   ...    ...   ...    ...    ...   ...     ...  ...    ...      ...     ...    ...     ...
501         501  0.06263   0.0  11.93   0.0  0.573  6.593  69.1  2.4786  1.0  273.0     21.0  391.99   9.67    22.4
502         502  0.04527   0.0  11.93   0.0  0.573  6.120  76.7  2.2875  1.0  273.0     21.0  396.90   9.08    20.6
503         503  0.06076   0.0  11.93   0.0  0.573  6.976  91.0  2.1675  1.0  273.0     21.0  396.90   5.64    23.9
504         504  0.10959   0.0  11.93   0.0  0.573  6.794  89.3  2.3889  1.0  273.0     21.0  393.45   6.48    22.0
505         505  0.04741   0.0  11.93   0.0  0.573  6.030  80.8  2.5050  1.0  273.0     21.0  396.90   7.88    11.9

[506 rows x 15 columns]
>>> df = pd.read_csv("boston.csv")
>>> X = df[df.columns[:-1]]
>>> y = df["target"]
>>> import xgboost as xgb
>>> from xgboost.sklearn import XGBRegressor
>>> model = XGBRegressor(n_estimators=20, max_depth=10, tree_method="hist")
>>> model.fit(X, y)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=10,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=20, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='hist',
             validate_parameters=1, verbosity=None)
>>> from sklearn.metrics import mean_squared_error
>>> prediction = model.predict(X)
>>> mean_squared_error(y, prediction)
0.08756953747034339

I am using a simple boston dataset. It can be downloaded via

from sklearn.datasets import load_boston
X, y = load_boston (return_X_y = True)

And on scala:

scala>   val spark = SparkSession.builder().getOrCreate()
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@6c1e2161

scala>   import spark.implicits._
import spark.implicits._

scala>   val df = spark.read.format("csv").option("header", "true").load("file://boston.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, CRIM: string ... 13 more fields]

scala>   df.show()
CSV file: file://boston.csv
+---+-------+----+-----+----+-----+-----+-----+------+---+-----+-------+------+-----+------+
|_c0|   CRIM|  ZN|INDUS|CHAS|  NOX|   RM|  AGE|   DIS|RAD|  TAX|PTRATIO|     B|LSTAT|target|
+---+-------+----+-----+----+-----+-----+-----+------+---+-----+-------+------+-----+------+
|  0|0.00632|18.0| 2.31| 0.0|0.538|6.575| 65.2|  4.09|1.0|296.0|   15.3| 396.9| 4.98|  24.0|
|  1|0.02731| 0.0| 7.07| 0.0|0.469|6.421| 78.9|4.9671|2.0|242.0|   17.8| 396.9| 9.14|  21.6|
|  2|0.02729| 0.0| 7.07| 0.0|0.469|7.185| 61.1|4.9671|2.0|242.0|   17.8|392.83| 4.03|  34.7|
|  3|0.03237| 0.0| 2.18| 0.0|0.458|6.998| 45.8|6.0622|3.0|222.0|   18.7|394.63| 2.94|  33.4|
|  4|0.06905| 0.0| 2.18| 0.0|0.458|7.147| 54.2|6.0622|3.0|222.0|   18.7| 396.9| 5.33|  36.2|
|  5|0.02985| 0.0| 2.18| 0.0|0.458| 6.43| 58.7|6.0622|3.0|222.0|   18.7|394.12| 5.21|  28.7|
|  6|0.08829|12.5| 7.87| 0.0|0.524|6.012| 66.6|5.5605|5.0|311.0|   15.2| 395.6|12.43|  22.9|
|  7|0.14455|12.5| 7.87| 0.0|0.524|6.172| 96.1|5.9505|5.0|311.0|   15.2| 396.9|19.15|  27.1|
|  8|0.21124|12.5| 7.87| 0.0|0.524|5.631|100.0|6.0821|5.0|311.0|   15.2|386.63|29.93|  16.5|
|  9|0.17004|12.5| 7.87| 0.0|0.524|6.004| 85.9|6.5921|5.0|311.0|   15.2|386.71| 17.1|  18.9|
| 10|0.22489|12.5| 7.87| 0.0|0.524|6.377| 94.3|6.3467|5.0|311.0|   15.2|392.52|20.45|  15.0|
| 11|0.11747|12.5| 7.87| 0.0|0.524|6.009| 82.9|6.2267|5.0|311.0|   15.2| 396.9|13.27|  18.9|
| 12|0.09378|12.5| 7.87| 0.0|0.524|5.889| 39.0|5.4509|5.0|311.0|   15.2| 390.5|15.71|  21.7|
| 13|0.62976| 0.0| 8.14| 0.0|0.538|5.949| 61.8|4.7075|4.0|307.0|   21.0| 396.9| 8.26|  20.4|
| 14|0.63796| 0.0| 8.14| 0.0|0.538|6.096| 84.5|4.4619|4.0|307.0|   21.0|380.02|10.26|  18.2|
| 15|0.62739| 0.0| 8.14| 0.0|0.538|5.834| 56.5|4.4986|4.0|307.0|   21.0|395.62| 8.47|  19.9|
| 16|1.05393| 0.0| 8.14| 0.0|0.538|5.935| 29.3|4.4986|4.0|307.0|   21.0|386.85| 6.58|  23.1|
| 17| 0.7842| 0.0| 8.14| 0.0|0.538| 5.99| 81.7|4.2579|4.0|307.0|   21.0|386.75|14.67|  17.5|
| 18|0.80271| 0.0| 8.14| 0.0|0.538|5.456| 36.6|3.7965|4.0|307.0|   21.0|288.99|11.69|  20.2|
| 19| 0.7258| 0.0| 8.14| 0.0|0.538|5.727| 69.5|3.7965|4.0|307.0|   21.0|390.95|11.28|  18.2|
+---+-------+----+-----+----+-----+-----+-----+------+---+-----+-------+------+-----+------+
only showing top 20 rows

scala>   val result = df.select(df.columns.map(c=> col(c).cast(DoubleType)): _*).drop("_c0")
result: org.apache.spark.sql.DataFrame = [CRIM: double, ZN: double ... 12 more fields]

scala>   val nFeatures = result.columns.size - 1
nFeatures: Int = 13

scala>   val vectorAssembler = new VectorAssembler().setInputCols(result.columns.take(nFeatures)).setOutputCol("features")
vectorAssembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_98e500861f3f

scala>   val xgbInput = vectorAssembler.transform(result).select("features", result.columns.last)
xgbInput: org.apache.spark.sql.DataFrame = [features: vector, target: double]

scala> println(xgbInput.select("features").first)
[[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98]]

scala>   val xgbParam = Map(
     |     "tree_method" -> "hist",
     |      "max_depth" -> 10,
     |     "num_round" -> 20,
     |     "num_workers" -> 1
     |   )
xgbParam: scala.collection.immutable.Map[String,Any] = Map(tree_method -> hist, max_depth -> 10, num_round -> 20, num_workers -> 1)

scala> val xgbRegressor = new XGBoostRegressor(xgbParam).setFeaturesCol("features").setLabelCol(result.columns.last)
xgbRegressor: ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor = xgbr_1a5c56ecd942

scala> val xgbRegressionModel = xgbRegressor.fit(xgbInput)
xgbRegressionModel: ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel = xgbr_1a5c56ecd942

scala> val res = xgbRegressionModel.transform(xgbInput)
res: org.apache.spark.sql.DataFrame = [features: vector, target: double ... 1 more field]

scala> val res_rdd = res.select("target", "prediction").rdd
res_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[41] at rdd at <console>:41

scala> val valuesAndPreds = res_rdd.map { point =>
     | val array = point.toSeq.toArray.map(_.asInstanceOf[Double])
     | (array(1), array(0))
     | }
valuesAndPreds: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[42] at map at <console>:41

scala> import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.mllib.evaluation.RegressionMetrics

scala> val metrics = new RegressionMetrics(valuesAndPreds)
metrics: org.apache.spark.mllib.evaluation.RegressionMetrics = org.apache.spark.mllib.evaluation.RegressionMetrics@9992404

scala> metrics.meanSquaredError
res3: Double = 0.11138415118383496

if increase number of estimators to 30, when I receive the following results:
0.022200668623300408 for distributed
0.009170220935654705 for single-node

I also tried to save the prediction result on scala and read it through python for a more correct comparison. The result is the same.

In a real project we have about 6000 estimators and in this case we get a difference of more than 10 times.

Environment info

XGBoost version: 1.1.0
git clone --recursive https://github.com/dmlc/xgboost
git checkout release_1.1.0
cd xgboost/jvm-packages/
mvn clean -DskipTests install package

PivovarA · 2020-08-17T07:54:54Z

Are there any updates here? Maybe I should provide some additional information?

trivialfis · 2020-08-18T03:51:13Z

There are some floating point difference that can not be prevented. Like sketching, we need to merge sketches from different workers which can generate additional difference. Also reducing histogram across workers, which generates floating point error due to non-associative aspect of floating point summation. Right now we are testing these kind of things with a tolerance. For large size of cluster I think the floating point difference can be larger. With dask now being supported, we are adding more and more tests on it, but so far no obvious bug is found.

FelixYBW · 2020-08-27T00:27:09Z

I did some test, here is the result:

Default value of max_bin and tree_method is different. Single node is 256 and "tree", spark is 16 and "auto". @trivialfis @CodingCat are they on purpose?
using the same parameter and make sure the same data set row order, single node, single rank Spark and single rank dmlc_submit generate exactly the same model. accurancy are all 0.08576235795386919
with two ranks, how the dataset is splited impacts the accurancy a lot. How the data is ordered also impact. In my tests I got
-- 0.11525561936939305 (253 rows and 253 rows, random order)
-- 0.09395430881665474 (253 and 253 rows, order by index)
-- 0.10249318660825679(252 rows and 254 rows)
-- 0.08658100097273445 (254 rows and 252 rows)
using 2 ranks, I still can't get exactly the same model between dmlc_submit and spark. I can make sure the input DMatrix are the same. Looks like some configs are still different.
-- Spark: 0.08658100097273445
-- dmlc-submit: 0.07866148830213669

The parameters I used:
param = { 'max_depth': 10,
'tree_method': 'hist',
'n_jobs': 1,
'max_bin':256,
'verbosity': 1,
'skip_drop': 0,
'normalize_type':'tree'
}

trivialfis · 2020-08-27T03:08:55Z

I'm sure 16 bins is not an optimal parameter ...

PivovarA · 2020-08-27T14:19:08Z

Different default values of parameter between different languages and modes can lead to serious problems: unpredictable increase in model size and different accuracy, as, for example, in this case.
Are there any good reasons why these parameters cannot be aligned?

FelixYBW · 2020-08-27T19:37:51Z

I compared all params. Spark doesn't support all params C has. All have the same default value except max_bin.
tree_method is auto as default in both case.

Created PR 6066 to fix max_bin

hcho3 · 2020-09-08T01:44:19Z

The default for the parameter maxBins was revised in Spark to be consistent with C++: #6066

trivialfis mentioned this issue Aug 17, 2020

Investigate accuracy difference for distributed CPU hist. #6022

Closed

FelixYBW mentioned this issue Aug 21, 2020

trained distributed xgboost model is much bigger than python single node xgboost model with same hyper parameters #6044

Closed

FelixYBW mentioned this issue Aug 27, 2020

[jvm-packages, Spark] set maxBins to 256. Align with c code #6066

Merged

hcho3 closed this as completed Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in accuracy between distributed scala.spark.xgboost4j and single-node xgboost #5977

Difference in accuracy between distributed scala.spark.xgboost4j and single-node xgboost #5977

PivovarA commented Aug 4, 2020

PivovarA commented Aug 17, 2020

trivialfis commented Aug 18, 2020

FelixYBW commented Aug 27, 2020 •

edited

Loading

trivialfis commented Aug 27, 2020

PivovarA commented Aug 27, 2020

FelixYBW commented Aug 27, 2020

hcho3 commented Sep 8, 2020

Difference in accuracy between distributed scala.spark.xgboost4j and single-node xgboost #5977

Difference in accuracy between distributed scala.spark.xgboost4j and single-node xgboost #5977

Comments

PivovarA commented Aug 4, 2020

Steps to reproduce

Environment info

PivovarA commented Aug 17, 2020

trivialfis commented Aug 18, 2020

FelixYBW commented Aug 27, 2020 • edited Loading

trivialfis commented Aug 27, 2020

PivovarA commented Aug 27, 2020

FelixYBW commented Aug 27, 2020

hcho3 commented Sep 8, 2020

FelixYBW commented Aug 27, 2020 •

edited

Loading