-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR to allow dataframe as validation
arg in xgboost
#771
Conversation
Preserve weights for internal validation set.
validation
arg in xgboost
validation
arg in xgboost
* handles case where validation has additional columns * add error for when `y` is vector and validation is a dataframe - fit_xy() issue.
…into xgb_validation
This PR will allow a dataframe to be passed as arg to
Please let me know if your team is interested in merge. If so I can add tests. Thanks! Examples below: library(tidyverse)
library(tidymodels)
# formula interface
# Validation as dataframe
reg_fit <-
boost_tree(trees = 10, mode = "regression") %>%
set_engine("xgboost",
eval_metric = "mae",
validation = mtcars[1:3,],
verbose = 1) %>%
fit(mpg ~ hp + disp + cyl, data = mtcars)
#> [1] validation-mae:14.428947
#> [2] validation-mae:11.075748
#> [3] validation-mae:8.496362
#> [4] validation-mae:5.707752
#> [5] validation-mae:4.324680
#> [6] validation-mae:3.336770
#> [7] validation-mae:1.835582
#> [8] validation-mae:1.294387
#> [9] validation-mae:0.872375
#> [10] validation-mae:0.673549
# formula interface
# Validation as prop
reg_fit <-
boost_tree(trees = 10, mode = "regression") %>%
set_engine("xgboost",
eval_metric = "mae",
validation = 0.2,
verbose = 1) %>%
fit(mpg ~ ., data = mtcars)
#> [1] validation-mae:12.133571
#> [2] validation-mae:8.886966
#> [3] validation-mae:6.678021
#> [4] validation-mae:4.898591
#> [5] validation-mae:3.576521
#> [6] validation-mae:2.651404
#> [7] validation-mae:2.103294
#> [8] validation-mae:1.825942
#> [9] validation-mae:1.769561
#> [10] validation-mae:1.770711
# xy interface
# Validation as dataframe
# Correctly errors
res <- boost_tree(trees = 10, mode = "regression") %>%
set_engine("xgboost",
eval_metric = "mae",
validation = mtcars,
verbose = 1) %>%
fit_xy(x = mtcars[,-1], y = mtcars$mpg)
#> Error in `parsnip::xgb_train()`:
#> ! `y` must be named when `validation` is a dataframe
# xy interface
# Validation as prop
reg_fit <-
boost_tree(trees = 10, mode = "regression") %>%
set_engine("xgboost",
eval_metric = "mae",
validation = 0.2,
verbose = 1) %>%
fit_xy(x = mtcars[,-1], y = mtcars$mpg)
#> [1] validation-mae:14.356102
#> [2] validation-mae:10.397969
#> [3] validation-mae:7.662204
#> [4] validation-mae:5.727999
#> [5] validation-mae:4.343752
#> [6] validation-mae:3.546708
#> [7] validation-mae:3.135106
#> [8] validation-mae:2.820952
#> [9] validation-mae:2.541687
#> [10] validation-mae:2.407671
# workflow interface
# Validation as dataframe
car_rec <-
recipes::recipe(mpg ~ disp + hp + cyl, data = mtcars) |>
recipes::step_mutate(hp_time_disp = hp * disp) |>
recipes::update_role(cyl, new_role = 'id')
reg_fit <-
boost_tree(trees = 10, mode = "regression") %>%
set_engine("xgboost",
eval_metric = "mae",
validation = recipes::prep(car_rec) |> recipes::bake(new_data = mtcars),
verbose = 1)
workflows::workflow(car_rec, reg_fit) |>
fit(data = mtcars)
#> [1] validation-mae:14.049658
#> [2] validation-mae:10.143387
#> [3] validation-mae:7.338691
#> [4] validation-mae:5.341355
#> [5] validation-mae:3.889570
#> [6] validation-mae:2.836564
#> [7] validation-mae:2.107599
#> [8] validation-mae:1.659913
#> [9] validation-mae:1.303094
#> [10] validation-mae:1.047100
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_mutate()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 13.5 Kb
#> call:
#> xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0,
#> colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1,
#> subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist,
#> verbose = 1, eval_metric = "mae", nthread = 1, objective = "reg:squarederror")
#> params (as set within xgb.train):
#> eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mae", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
#> xgb.attributes:
#> niter
#> callbacks:
#> cb.print.evaluation(period = print_every_n)
#> cb.evaluation.log()
#> # of features: 3
#> niter: 10
#> nfeatures : 3
#> evaluation_log:
#> iter validation_mae
#> 1 14.049658
#> 2 10.143387
#> ---
#> 9 1.303094
#> 10 1.047100
# workflow interface
# Validation is missing predictors
# Errors correctly and shows missing predictors
car_rec <-
recipes::recipe(mpg ~ disp + hp + cyl, data = mtcars) |>
recipes::step_mutate(hp_time_disp = hp * disp) |>
recipes::update_role(cyl, new_role = 'id')
reg_fit <-
boost_tree(trees = 10, mode = "regression") %>%
set_engine("xgboost",
eval_metric = "mae",
validation = mtcars,
verbose = 1)
workflows::workflow(car_rec, reg_fit) |>
fit(data = mtcars)
#> Error in `parsnip::xgb_train()`:
#> ! `validation` is missing column(s): `hp_time_disp`
##################### Classification
# Fit Interface
# Validation as Prop
reg_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
verbose = 1,
validation = 0.1) |>
fit(Class ~ A + B, data = two_class_dat)
#> [1] validation-logloss:0.578083
#> [2] validation-logloss:0.510125
#> [3] validation-logloss:0.473173
#> [4] validation-logloss:0.458129
#> [5] validation-logloss:0.456636
#> [6] validation-logloss:0.453387
#> [7] validation-logloss:0.459217
#> [8] validation-logloss:0.460738
#> [9] validation-logloss:0.471640
#> [10] validation-logloss:0.478731
# Fit Interface
# Validation as dataframe
reg_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
verbose = 1,
validation = two_class_dat[1:3,]) |>
fit(Class ~ A + B, data = two_class_dat)
#> [1] validation-logloss:0.635771
#> [2] validation-logloss:0.624781
#> [3] validation-logloss:0.627140
#> [4] validation-logloss:0.591926
#> [5] validation-logloss:0.589791
#> [6] validation-logloss:0.535597
#> [7] validation-logloss:0.526764
#> [8] validation-logloss:0.527316
#> [9] validation-logloss:0.521822
#> [10] validation-logloss:0.545214
# XY Interface
# Validation as Prop
reg_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
verbose = 1,
validation = 0.1) |>
fit_xy(y = two_class_dat$Class, x = two_class_dat[,-3])
#> [1] validation-logloss:0.576877
#> [2] validation-logloss:0.515689
#> [3] validation-logloss:0.490827
#> [4] validation-logloss:0.478492
#> [5] validation-logloss:0.474726
#> [6] validation-logloss:0.472732
#> [7] validation-logloss:0.480608
#> [8] validation-logloss:0.487469
#> [9] validation-logloss:0.484555
#> [10] validation-logloss:0.483709
# XY Interface
# Validation as dataframe
# Correctly errors
reg_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
verbose = 1,
validation = two_class_dat) |>
fit_xy(y = two_class_dat$Class, x = two_class_dat[,-3])
#> Error in `parsnip::xgb_train()`:
#> ! `y` must be named when `validation` is a dataframe
# Recipe Interface
# Validation as dataframe missing vars
class_rec <-
recipes::recipe(Class ~ A + B, data = two_class_dat) |>
recipes::step_mutate(a_b = A * B) |>
recipes::update_role(A, new_role = 'id')
reg_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
eval_metric = "mae",
validation = two_class_dat,
verbose = 1)
workflows::workflow(class_rec, reg_fit) |>
fit(data = two_class_dat)
#> Error in `parsnip::xgb_train()`:
#> ! `validation` is missing column(s): `a_b`
# Recipe Interface
# Validation as prop
class_rec <-
recipes::recipe(Class ~ A + B, data = two_class_dat) |>
recipes::step_mutate(a_b = A * B) |>
recipes::update_role(A, new_role = 'id')
reg_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
eval_metric = "mae",
validation = 0.1,
verbose = 1)
workflows::workflow(class_rec, reg_fit) |>
fit(data = two_class_dat)
#> [1] validation-mae:0.439349
#> [2] validation-mae:0.401310
#> [3] validation-mae:0.370106
#> [4] validation-mae:0.344822
#> [5] validation-mae:0.321370
#> [6] validation-mae:0.312411
#> [7] validation-mae:0.305393
#> [8] validation-mae:0.299720
#> [9] validation-mae:0.290905
#> [10] validation-mae:0.285693
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_mutate()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 31.7 Kb
#> call:
#> xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0,
#> colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1,
#> subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist,
#> verbose = 1, eval_metric = "mae", nthread = 1, objective = "binary:logistic")
#> params (as set within xgb.train):
#> eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mae", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
#> xgb.attributes:
#> niter
#> callbacks:
#> cb.print.evaluation(period = print_every_n)
#> cb.evaluation.log()
#> # of features: 2
#> niter: 10
#> nfeatures : 2
#> evaluation_log:
#> iter validation_mae
#> 1 0.4393490
#> 2 0.4013096
#> ---
#> 9 0.2909054
#> 10 0.2856933
# Recipe Interface
# Validation as baked data
class_rec <-
recipes::recipe(Class ~ A + B, data = two_class_dat) |>
recipes::step_mutate(a_b = A * B) |>
recipes::update_role(A, new_role = 'id')
reg_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
eval_metric = "mae",
validation = recipes::prep(class_rec) |> recipes::bake(new_data = two_class_dat),
verbose = 1)
workflows::workflow(class_rec, reg_fit) |>
fit(data = two_class_dat)
#> [1] validation-mae:0.421717
#> [2] validation-mae:0.364818
#> [3] validation-mae:0.323719
#> [4] validation-mae:0.294142
#> [5] validation-mae:0.268857
#> [6] validation-mae:0.249749
#> [7] validation-mae:0.235352
#> [8] validation-mae:0.223877
#> [9] validation-mae:0.214782
#> [10] validation-mae:0.208277
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_mutate()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 32.5 Kb
#> call:
#> xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0,
#> colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1,
#> subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist,
#> verbose = 1, eval_metric = "mae", nthread = 1, objective = "binary:logistic")
#> params (as set within xgb.train):
#> eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mae", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
#> xgb.attributes:
#> niter
#> callbacks:
#> cb.print.evaluation(period = print_every_n)
#> cb.evaluation.log()
#> # of features: 2
#> niter: 10
#> nfeatures : 2
#> evaluation_log:
#> iter validation_mae
#> 1 0.4217165
#> 2 0.3648176
#> ---
#> 9 0.2147818
#> 10 0.2082772
########### Multiclass
# Fit Interface
# Validation as prop
reg_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
verbose = 1,
validation = 0.1) |>
fit(species ~ bill_length_mm + bill_depth_mm, data = penguins)
#> [1] validation-mlogloss:0.756842
#> [2] validation-mlogloss:0.551547
#> [3] validation-mlogloss:0.414491
#> [4] validation-mlogloss:0.315939
#> [5] validation-mlogloss:0.248686
#> [6] validation-mlogloss:0.205213
#> [7] validation-mlogloss:0.172328
#> [8] validation-mlogloss:0.146289
#> [9] validation-mlogloss:0.119266
#> [10] validation-mlogloss:0.099206
# Fit Interface
# Validation as dataframe
reg_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
verbose = 1,
validation = penguins) |>
fit(species ~ bill_length_mm + bill_depth_mm, data = penguins)
#> [1] validation-mlogloss:0.741214
#> [2] validation-mlogloss:0.532345
#> [3] validation-mlogloss:0.395555
#> [4] validation-mlogloss:0.301706
#> [5] validation-mlogloss:0.232278
#> [6] validation-mlogloss:0.182016
#> [7] validation-mlogloss:0.144608
#> [8] validation-mlogloss:0.116754
#> [9] validation-mlogloss:0.095399
#> [10] validation-mlogloss:0.079533
#XY interface
#Validation as prop
reg_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
verbose = 1,
validation = 0.1) |>
fit_xy(y = penguins$species, x = penguins[,c(3,4)])
#> [1] validation-mlogloss:0.777307
#> [2] validation-mlogloss:0.583527
#> [3] validation-mlogloss:0.462445
#> [4] validation-mlogloss:0.364660
#> [5] validation-mlogloss:0.291666
#> [6] validation-mlogloss:0.238673
#> [7] validation-mlogloss:0.201141
#> [8] validation-mlogloss:0.173881
#> [9] validation-mlogloss:0.150524
#> [10] validation-mlogloss:0.130961
# XY interface
# Validation as dataframe
# Correctly errors
reg_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
verbose = 1,
validation = penguins) |>
fit_xy(y = penguins$species, x = penguins[,c(3,4)])
#> Error in `parsnip::xgb_train()`:
#> ! `y` must be named when `validation` is a dataframe
# Recipe Interface
# Validation as dataframe missing vars
class_rec <-
recipes::recipe(species ~ bill_length_mm + bill_depth_mm, data = penguins) |>
recipes::step_mutate(b_2 = bill_length_mm * 2)
class_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
eval_metric = "mlogloss",
validation = penguins,
verbose = 1)
workflows::workflow(class_rec, class_fit) |>
fit(data = penguins)
#> Error in `parsnip::xgb_train()`:
#> ! `validation` is missing column(s): `b_2`
# Recipe Interface
# Validation as baked dataframe
class_rec <-
recipes::recipe(species ~ bill_length_mm + bill_depth_mm, data = penguins) |>
recipes::step_mutate(b_2 = bill_length_mm * 2)
class_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
eval_metric = "mlogloss",
validation = recipes::prep(class_rec) |> recipes::bake(new_data = penguins),
verbose = 1)
workflows::workflow(class_rec, class_fit) |>
fit(data = penguins)
#> [1] validation-mlogloss:0.741382
#> [2] validation-mlogloss:0.531399
#> [3] validation-mlogloss:0.394591
#> [4] validation-mlogloss:0.299946
#> [5] validation-mlogloss:0.231754
#> [6] validation-mlogloss:0.181127
#> [7] validation-mlogloss:0.143169
#> [8] validation-mlogloss:0.115443
#> [9] validation-mlogloss:0.094732
#> [10] validation-mlogloss:0.078962
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_mutate()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 42.4 Kb
#> call:
#> xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0,
#> colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1,
#> subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist,
#> verbose = 1, eval_metric = "mlogloss", nthread = 1, objective = "multi:softprob",
#> num_class = 3L)
#> params (as set within xgb.train):
#> eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mlogloss", nthread = "1", objective = "multi:softprob", num_class = "3", validate_parameters = "TRUE"
#> xgb.attributes:
#> niter
#> callbacks:
#> cb.print.evaluation(period = print_every_n)
#> cb.evaluation.log()
#> # of features: 3
#> niter: 10
#> nfeatures : 3
#> evaluation_log:
#> iter validation_mlogloss
#> 1 0.74138198
#> 2 0.53139859
#> ---
#> 9 0.09473197
#> 10 0.07896238
# Recipe Interface
# Validation as prop
class_rec <-
recipes::recipe(species ~ bill_length_mm + bill_depth_mm, data = penguins) |>
recipes::step_mutate(b_2 = bill_length_mm * 2)
class_fit <-
boost_tree(trees = 10, mode = "classification") %>%
set_engine("xgboost",
eval_metric = "mlogloss",
validation = 0.2,
verbose = 1)
workflows::workflow(class_rec, class_fit) |>
fit(data = penguins)
#> [1] validation-mlogloss:0.749688
#> [2] validation-mlogloss:0.554280
#> [3] validation-mlogloss:0.428803
#> [4] validation-mlogloss:0.338456
#> [5] validation-mlogloss:0.280671
#> [6] validation-mlogloss:0.236317
#> [7] validation-mlogloss:0.204086
#> [8] validation-mlogloss:0.183910
#> [9] validation-mlogloss:0.168322
#> [10] validation-mlogloss:0.156616
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_mutate()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 40.5 Kb
#> call:
#> xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0,
#> colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1,
#> subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist,
#> verbose = 1, eval_metric = "mlogloss", nthread = 1, objective = "multi:softprob",
#> num_class = 3L)
#> params (as set within xgb.train):
#> eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mlogloss", nthread = "1", objective = "multi:softprob", num_class = "3", validate_parameters = "TRUE"
#> xgb.attributes:
#> niter
#> callbacks:
#> cb.print.evaluation(period = print_every_n)
#> cb.evaluation.log()
#> # of features: 3
#> niter: 10
#> nfeatures : 3
#> evaluation_log:
#> iter validation_mlogloss
#> 1 0.7496881
#> 2 0.5542800
#> ---
#> 9 0.1683216
#> 10 0.1566161 Created on 2022-08-22 by the reprex package (v2.0.1) |
validation
arg in xgboost
validation
arg in xgboost
@simonpcouch any chance you could review this PR and provide some suggestions if you think it needs more work? |
@joeycouse, yes, sure thing! Apologies for the lack of response here. Will slot out some time sooner than later to spend some time with this. I've gone ahead and approved running GHA workflows so that you can make use of our CI. |
Thanks again for all of your effort here, @joeycouse! I appreciate you documenting this proposal so thoroughly. We've chatted about these changes and decided that we'll close this PR. There are some changes to machinery that we'd like to keep stable, some changes that we'd like to instead lean more heavily on existing machinery for, and some differences in interface compared to where we'd like to land with this feature. You note two changes that we'd definitely be interested in incorporating:
If you're up for introducing smaller PRs that integrate just these changes, I'd be happy to review. Regardless, we'll definitely be acknowledging your work here in the release notes when we introduce functionality like this in the future. Thanks again for the PR and raising the points of consideration that you have. |
This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Just a draft of allowing user to pass a dataframe as the
validation
arg forxgboost
. From what I can tell, the main issue that prevents from passing a df with extra columns is the fact that wheny
is passed toas_xgb_data()
:parsnip/R/boost_tree.R
Line 380 in f8505bd
The
y
vector is unnamed, so you can't just select the columns fromx
and which column corresponds toy
to pass toxgb.DMatrix()
Also case weights weren't being passed to the internal validation set in the current implementation.
Would be happy to further develop this PR with some advice from your team
Examples