Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trained distributed xgboost model is much bigger than python single node xgboost model with same hyper parameters #6044

Closed
dding3 opened this issue Aug 21, 2020 · 7 comments
Assignees

Comments

@dding3
Copy link

dding3 commented Aug 21, 2020

Trained a distributed XGBRegressor with XGBoost4J-Spark as:

val xgbRf0 = new XGBoostRegressor()
xgbRf0.setNumRound(500)
xgbRf0.setMaxDepth(50)
xgbRf0.setNthread(20)
xgbRf0.setTreeMethod("hist")
xgbRf0.setSeed(2) 
xgbRf0.setEta(0.1)
xgbRf0.setMinChildWeight(1)
xgbRf0.setSubsample(0.8)
xgbRf0.setColsampleBytree(0.8)
xgbRf0.setGamma(0)
xgbRf0.setAlpha(0)
xgbRf0.setLambda(1)
xgbRf0.setNumWorkers(3)

Also trained a single node python xgboostregressor as:

    xgb_rf0 = XGBRegressor(n_estimators=500, max_depth=50, n_jobs=-1, tree_method='hist',
                           random_state=2,learning_rate=0.1, min_child_weight=1, seed=0,
                           subsample= 0.8, colsample_bytree=0.8, gamma=0, reg_alpha=0,
                           reg_lambda=1,verbosity=0)

The two models are trained with the same training dataset. And the training dataset has more than 6 million records. After we save the two models on file, we found the model size is very different. 2G for distributed xgboost model while 350M for single node python xgboost.

xgboost version is 0.9

@dding3 dding3 changed the title Saved distributed xgboost model is much bigger than python single node xgboost mode trained distributed xgboost model is much bigger than python single node xgboost model with same hyper parameters Aug 21, 2020
@trivialfis
Copy link
Member

Can you dump out the tree and see the difference? If you can use 1.2rc: #5970 you can also try JSON output and see the difference, like number of trees in total.

@trivialfis
Copy link
Member

trivialfis commented Aug 21, 2020

Unrelated, I'm not sure why would you set max depth to 50, that's 1125899906842624 leaf. I don't think it make sense of have something this huge for 6 million records data.

@trivialfis
Copy link
Member

trivialfis commented Aug 21, 2020

I think there might be integer overflow during training as the tree node is represented as 32bit integers...

@FelixYBW
Copy link
Contributor

It's the same issue as #5977 and #6022 actually. Root cause is why distributed xgboost generate a much different model (larger model size, lower accurancy) from single node xgboost.

@trivialfis
Copy link
Member

ping @ShvetsKS @SmirnovEgorRu would you please help taking a look?

@trivialfis trivialfis self-assigned this Aug 26, 2020
@FelixYBW
Copy link
Contributor

We have make clear the root cause and solve the issue, There are two reasons why the spark model is so different from single node model:

  • spark's maxbins is different from cpp. It's already fixed
  • training data set is sorted implicitly by Spark in one operator

Large max_depth may lead to a large model, but a stable model is decided by the dataset itself. e.g. we eventually set the max_depths to 100 but the at last the model's max depth is only 30~40.

@dding3
Copy link
Author

dding3 commented Sep 1, 2020

Thank you for the investigation. Have increased maxbins and repartition data frame before feed it into distributed xgboost model and the model size is as big as python single node now.

@dding3 dding3 closed this as completed Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants