-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Best iteration index from early stopping is discarded when model is saved to disk #5209
Comments
There shouldn't be any rounding error for both binary or json. Are you using dart? |
No, I am not:
|
Can you provide us with dummy data where this phenomenon occurs? |
Here you go (Quick & Dirty):
On my machine,
You see that the predictions sometimes differ slightly. Can you rerun the example code on your machine and see whether it has the same problem? Some extra info about R and xgboost:
Also note that:
|
Thanks for the script. Just an update, I ran it with both saving to json and binary file with: xgboost::xgb.save(fit, 'booster.json')
fit.loaded <- xgboost::xgb.load('booster.json')
xgboost::xgb.save(fit.loaded, 'booster-1.json') The hash values |
Why close the issue without knowing the cause? |
@trivialfis Did you get True for |
I’ll try to reproduce it myself. |
Oh, sorry. The cause I found is the prediction cache. After loading the model, prediction values come from true prediction, instead of cached value:
|
So prediction cache interacts with floating-point arithmetic in a destructive way? |
@hcho3 It's a problem I found during implementing the new pickling method. I believe it plays a major role in here. So first reduce the number of trees down to 1000 (which is still pretty huge and should be enough for demo). Re-construct the DMatrix before prediction to get the cache out of the way:
It will pass the Getting in more trees the identical test still has small differences (1e-7 for 2000 trees). But do we need to produce bit by bit identical result even in multi-threaded environment? |
As the floating point summation is not associative, we can make it as a to do item to have strong guarantee for order of computation, if that's desired. |
Actually making strong guarantee for order won't work (will help a lot but still there will be discrepancy). A floating point in CPU FPU register can have higher precision then stored back to memory. (Hardware implementation can use higher precision for intermedia values, https://en.wikipedia.org/wiki/Extended_precision). My point is when the result for 1000 trees are exactly reproducible within 32 bit float, it's unlikely a programming bug. |
I agree that floating-point summation is not associative. I will run the script myself and see if the difference is small enough to attribute to the floating-point arithmetic. In general, I usually use |
Yup. Apologies for closing without detailed notes. |
@hcho3 Any update? |
I haven’t gotten around it yet. Let me take a look this week. |
@trivialfis I managed to reproduce the bug. I ran the provided script and got Output from @DavorJ's script:
Output from modified script, with
So something else must be going on. I also tried running a round-trip test: xgboost::xgb.save(fit, 'booster.raw')
fit.loaded <- xgboost::xgb.load('booster.raw')
xgboost::xgb.save(fit.loaded, 'booster.raw.roundtrip') and the two binary files |
Max diff between |
A smaller example that runs faster: library(xgboost)
N <- 5000
set.seed(2020)
X <- data.frame('X1' = rnorm(N), 'X2' = runif(N), 'X3' = rpois(N, lambda = 1))
Y <- with(X, X1 + X2 - X3 + X1*X2^2 - ifelse(X1 > 0, 2, X3))
params <- list(objective = 'reg:squarederror',
max_depth = 5, eta = 0.02, subsammple = 0.5,
base_score = median(Y)
)
dtrain <- xgboost::xgb.DMatrix(data = data.matrix(X), label = Y)
fit <- xgboost::xgb.train(
params = params, data = dtrain,
watchlist = list('train' = dtrain),
nrounds = 10000, verbose = TRUE, print_every_n = 25,
eval_metric = 'mae',
early_stopping_rounds = 3, maximize = FALSE
)
pred <- stats::predict(fit, newdata = dtrain)
invisible(xgboost::xgb.save(fit, 'booster.raw'))
fit.loaded <- xgboost::xgb.load('booster.raw')
invisible(xgboost::xgb.save(fit.loaded, 'booster.raw.roundtrip'))
pred.loaded <- stats::predict(fit.loaded, newdata = dtrain)
identical(pred, pred.loaded)
pred[1:10]
pred.loaded[1:10]
max(abs(pred - pred.loaded))
sqrt(mean((Y - pred)^2))
sqrt(mean((Y - pred.loaded)^2)) Output:
|
Just tried doing one extra round-trip, and now predictions do not change any more. library(xgboost)
N <- 5000
set.seed(2020)
X <- data.frame('X1' = rnorm(N), 'X2' = runif(N), 'X3' = rpois(N, lambda = 1))
Y <- with(X, X1 + X2 - X3 + X1*X2^2 - ifelse(X1 > 0, 2, X3))
params <- list(objective = 'reg:squarederror',
max_depth = 5, eta = 0.02, subsammple = 0.5,
base_score = median(Y)
)
dtrain <- xgboost::xgb.DMatrix(data = data.matrix(X), label = Y)
fit <- xgboost::xgb.train(
params = params, data = dtrain,
watchlist = list('train' = dtrain),
nrounds = 10000, verbose = TRUE, print_every_n = 25,
eval_metric = 'mae',
early_stopping_rounds = 3, maximize = FALSE
)
pred <- stats::predict(fit, newdata = dtrain)
invisible(xgboost::xgb.save(fit, 'booster.raw'))
fit.loaded <- xgboost::xgb.load('booster.raw')
invisible(xgboost::xgb.save(fit.loaded, 'booster.raw.roundtrip'))
fit.loaded2 <- xgboost::xgb.load('booster.raw.roundtrip')
pred.loaded <- stats::predict(fit.loaded, newdata = dtrain)
pred.loaded2 <- stats::predict(fit.loaded2, newdata = dtrain)
identical(pred, pred.loaded)
identical(pred.loaded, pred.loaded2)
pred[1:10]
pred.loaded[1:10]
pred.loaded2[1:10]
max(abs(pred - pred.loaded))
max(abs(pred.loaded - pred.loaded2))
sqrt(mean((Y - pred)^2))
sqrt(mean((Y - pred.loaded)^2))
sqrt(mean((Y - pred.loaded2)^2)) Result:
So maybe prediction cache is indeed a problem. |
I re-ran the script with prediction caching disabled: diff --git a/src/predictor/cpu_predictor.cc b/src/predictor/cpu_predictor.cc
index ebc15128..c40309bc 100644
--- a/src/predictor/cpu_predictor.cc
+++ b/src/predictor/cpu_predictor.cc
@@ -259,7 +259,7 @@ class CPUPredictor : public Predictor {
// delta means {size of forest} * {number of newly accumulated layers}
uint32_t delta = end_version - beg_version;
CHECK_LE(delta, model.trees.size());
- predts->Update(delta);
+ //predts->Update(delta);
CHECK(out_preds->Size() == output_groups * dmat->Info().num_row_ ||
out_preds->Size() == dmat->Info().num_row_); (Disabling prediction caching results in very slow training.) Output:
So prediction cache is definitely NOT the cause of this bug. |
Leaf predictions diverge too:
Output:
|
Mystery solved. I identified the true cause. When the model is saved to disk, information about early stopping is discarded. In the example, XGBoost runs 6381 boosting rounds and find the best model at 6378 rounds. The model object in memory contains 6381 trees, not 6378 trees, since no tree is removed. There is an extra field
This extra field is silently discarded when we save the model to disk. So
|
@trivialfis I am inclined to physically remove trees. If training stopped at 6381 rounds and the best iteration was at 6378 rounds, users will expect the final model to have 6378 trees. |
@hcho3, nice find! Note also the documentation of If I understand correctly, the documentation is not entirely correct? Based on the documentation, I was expecting that predicting already used all trees. Unfortunately I had not verified this. |
@DavorJ When early stopping is activated, |
@trivialfis Situation is worse on the Python side, as import xgboost as xgb
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
params = {'objective': 'reg:squarederror'}
bst = xgb.train(params, dtrain, 100, [(dtrain, 'train'), (dtest, 'test')],
early_stopping_rounds=5)
x = bst.predict(dtrain, pred_leaf=True)
x2 = bst.predict(dtrain, pred_leaf=True, ntree_limit=bst.best_iteration)
print(x.shape)
print(x2.shape)
pred = bst.predict(dtrain)
pred2 = bst.predict(dtrain, ntree_limit=bst.best_iteration)
print(np.max(np.abs(pred - pred2))) Output:
Users will have to remember to fetch We have two options for a fix:
|
@hcho3 I have a half baked idea about this, which is also related to our BackgroundFor a short recap of the issues we have with For a short introduction of issues with forest, IdeaI want to add a
FurtherAlso I believe this is somehow connected to multi target trees. As if we can support multi-class multi-target trees in the future, there will be multiple ways to arrange trees, like using Also #5531 . |
But the idea is quite early so I didn't have the confidence to share it, now we are on this issue, maybe I can get some inputs about this. |
Given the 1.1 timeline, can we expand the documentation to clarify how users need to manually capture and use this best iteration in prediction? |
@trivialfis sounds interesting, so long as we are not further complicating configuration issues by doing this. Deleting the extra trees from the model as suggested by @hcho3 is appealing as we don't have to deal with any inconsistencies from having an actual model length and a theoretical model length at the same time. |
These values are predicted after
xgboost::xgb.train
:247367.2 258693.3 149572.2 201675.8 250493.9 292349.2 414828.0 296503.2 260851.9 190413.3
These values are predicted after
xgboost::xgb.save
andxgboost::xgb.load
of the previous model:247508.8 258658.2 149252.1 201692.6 250458.1 292313.4 414787.2 296462.5 260879.0 190430.1
They are close, but not the same. The differences between these two predictions range from
-1317.094
to1088.859
on a set of 25k samples. When comparing with true labels, then the MAE/RMSE of these two predictions do not differ much.So I suspect that this has to do with rounding errors during load/save since the MAE/RMSE do not differ as much. Still, I find this strange since binary storing the model should not introduce rounding errors?
Anyone a clue?
PS Uploading and documenting the training process seems not important to me here. I could provide details if necessary, or make a simulation with dummy data to prove the point.
The text was updated successfully, but these errors were encountered: