-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference in accuracy between distributed scala.spark.xgboost4j and single-node xgboost #5977
Comments
Are there any updates here? Maybe I should provide some additional information? |
There are some floating point difference that can not be prevented. Like sketching, we need to merge sketches from different workers which can generate additional difference. Also reducing histogram across workers, which generates floating point error due to non-associative aspect of floating point summation. Right now we are testing these kind of things with a tolerance. For large size of cluster I think the floating point difference can be larger. With dask now being supported, we are adding more and more tests on it, but so far no obvious bug is found. |
I did some test, here is the result:
The parameters I used: |
I'm sure 16 bins is not an optimal parameter ... |
Different default values of parameter between different languages and modes can lead to serious problems: unpredictable increase in model size and different accuracy, as, for example, in this case. |
I compared all params. Spark doesn't support all params C has. All have the same default value except max_bin. Created PR 6066 to fix max_bin |
The default for the parameter |
I noticed a difference in accuracy between distributed implementation of xgboost and single-node.
Moreover, the difference in accuracy changes with an increase in the number of estimators.
I would like to clarify if this behavior is expected and what could be the root cause of it?
I've prepared prepared a simple reproducer just to shows difference and how accuracy changes.
Steps to reproduce
I am using a simple boston dataset. It can be downloaded via
And on scala:
if increase number of estimators to 30, when I receive the following results:
0.022200668623300408 for distributed
0.009170220935654705 for single-node
I also tried to save the prediction result on scala and read it through python for a more correct comparison. The result is the same.
In a real project we have about 6000 estimators and in this case we get a difference of more than 10 times.
Environment info
XGBoost version: 1.1.0
git clone --recursive https://github.com/dmlc/xgboost
git checkout release_1.1.0
cd xgboost/jvm-packages/
mvn clean -DskipTests install package
The text was updated successfully, but these errors were encountered: