PR to allow dataframe as `validation` arg in `xgboost` #771

joeycouse · 2022-07-20T18:26:19Z

Just a draft of allowing user to pass a dataframe as the validation arg for xgboost. From what I can tell, the main issue that prevents from passing a df with extra columns is the fact that when y is passed to as_xgb_data():

parsnip/R/boost_tree.R

Line 380 in f8505bd

    
           as_xgb_data <- function(x, y, validation = 0, weights = NULL, event_level = "first", ...) {

The y vector is unnamed, so you can't just select the columns from x and which column corresponds to y to pass to
xgb.DMatrix()

Also case weights weren't being passed to the internal validation set in the current implementation.

Would be happy to further develop this PR with some advice from your team

Examples

> # Works 
> reg_fit <-
+   boost_tree(trees = 10, mode = "regression") %>%
+   set_engine("xgboost", 
+              eval_metric = "mae", 
+              validation = mtcars[1:3,],
+              verbose = 1) %>%
+   fit(mpg ~ ., data = mtcars)
[1]	validation-mae:14.428947 
[2]	validation-mae:10.406661 
[3]	validation-mae:7.885722 
[4]	validation-mae:5.953925 
[5]	validation-mae:4.558322 
[6]	validation-mae:3.400241 
[7]	validation-mae:2.560631 
[8]	validation-mae:1.757373 
[9]	validation-mae:1.187967 
[10]	validation-mae:0.787739 
> 
> reg_fit <-
+   boost_tree(trees = 10, mode = "regression") %>%
+   set_engine("xgboost", 
+              eval_metric = "mae", 
+              validation = 0.2,
+              verbose = 1) %>%
+   fit(mpg ~ ., data = mtcars)
[1]	validation-mae:14.203231 
[2]	validation-mae:9.799767 
[3]	validation-mae:7.313254 
[4]	validation-mae:5.302311 
[5]	validation-mae:3.932890 
[6]	validation-mae:2.969718 
[7]	validation-mae:2.525703 
[8]	validation-mae:2.261265 
[9]	validation-mae:2.169489 
[10]	validation-mae:2.301616 
> 
> 
> # Errors
> car_rec <- 
+   recipe(mpg ~ disp + hp + cyl, data = mtcars) |> 
+   update_role(cyl, new_role = 'id')
> 
> 
> reg_fit <-
+     boost_tree(trees = 10, mode = "regression") %>%
+     set_engine("xgboost", 
+                eval_metric = "mae", 
+                validation = mtcars[,c(1,2,3,4)],
+                verbose = 1) 
> 
> result <- 
+   workflows::workflow(car_rec, reg_fit) |> 
+   fit(data = mtcars)
Error in `parsnip::xgb_train()`:
! `validation` should contain 3 columns
Run `rlang::last_error()` to see where the error occurred.
> 
> mtcars_random <-
+   mtcars |> 
+   mutate(random = runif(nrow(mtcars), 0, 10))
> 
> reg_fit <-
+   boost_tree(trees = 10, mode = "regression") %>%
+   set_engine("xgboost", 
+              eval_metric = "mae", 
+              validation = mtcars_random,
+              verbose = 1) %>%
+   fit(mpg ~ ., data = mtcars)
Error in `parsnip::xgb_train()`:
! `validation` should contain 11 columns
Run `rlang::last_error()` to see where the error occurred.

Preserve weights for internal validation set.

…only

* handles case where validation has additional columns * add error for when `y` is vector and validation is a dataframe - fit_xy() issue.

…into xgb_validation

joeycouse · 2022-08-22T18:11:25Z

This PR will allow a dataframe to be passed as arg to validation for xgboost. Some changes of note.

Modified internals function will_make_matrix() to return True if input is numeric vector or matrix Internals bug? will_make_matrix() returns False when given a matrix #786
Updated corresponding test to expected named numeric matrix
Edge case handling for when mode is classification and engine is xgboost
saves factor column name as col_name attr for when mode is classification and engine = "xgboost" so it is available when passed to xgb_train()

Please let me know if your team is interested in merge. If so I can add tests. Thanks!

Examples below:

library(tidyverse)
library(tidymodels)

# formula interface
# Validation as dataframe
reg_fit <-
  boost_tree(trees = 10, mode = "regression") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = mtcars[1:3,],
             verbose = 1) %>%
  fit(mpg ~ hp + disp + cyl, data = mtcars)
#> [1]  validation-mae:14.428947 
#> [2]  validation-mae:11.075748 
#> [3]  validation-mae:8.496362 
#> [4]  validation-mae:5.707752 
#> [5]  validation-mae:4.324680 
#> [6]  validation-mae:3.336770 
#> [7]  validation-mae:1.835582 
#> [8]  validation-mae:1.294387 
#> [9]  validation-mae:0.872375 
#> [10] validation-mae:0.673549


# formula interface
# Validation as prop
reg_fit <-
  boost_tree(trees = 10, mode = "regression") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = 0.2,
             verbose = 1) %>%
  fit(mpg ~ ., data = mtcars)
#> [1]  validation-mae:12.133571 
#> [2]  validation-mae:8.886966 
#> [3]  validation-mae:6.678021 
#> [4]  validation-mae:4.898591 
#> [5]  validation-mae:3.576521 
#> [6]  validation-mae:2.651404 
#> [7]  validation-mae:2.103294 
#> [8]  validation-mae:1.825942 
#> [9]  validation-mae:1.769561 
#> [10] validation-mae:1.770711


# xy interface
# Validation as dataframe
# Correctly errors

res <- boost_tree(trees = 10, mode = "regression") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = mtcars,
             verbose = 1) %>%
  fit_xy(x = mtcars[,-1],  y = mtcars$mpg)
#> Error in `parsnip::xgb_train()`:
#> ! `y` must be named when `validation` is a dataframe


# xy interface
# Validation as prop
reg_fit <-
  boost_tree(trees = 10, mode = "regression") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = 0.2,
             verbose = 1) %>%
  fit_xy(x = mtcars[,-1],  y = mtcars$mpg)
#> [1]  validation-mae:14.356102 
#> [2]  validation-mae:10.397969 
#> [3]  validation-mae:7.662204 
#> [4]  validation-mae:5.727999 
#> [5]  validation-mae:4.343752 
#> [6]  validation-mae:3.546708 
#> [7]  validation-mae:3.135106 
#> [8]  validation-mae:2.820952 
#> [9]  validation-mae:2.541687 
#> [10] validation-mae:2.407671


# workflow interface
# Validation as dataframe
car_rec <-
  recipes::recipe(mpg ~ disp + hp + cyl, data = mtcars) |>
  recipes::step_mutate(hp_time_disp = hp * disp) |>
  recipes::update_role(cyl, new_role = 'id')

reg_fit <-
    boost_tree(trees = 10, mode = "regression") %>%
    set_engine("xgboost",
               eval_metric = "mae",
               validation = recipes::prep(car_rec) |> recipes::bake(new_data = mtcars),
               verbose = 1)

workflows::workflow(car_rec, reg_fit) |>
  fit(data = mtcars)
#> [1]  validation-mae:14.049658 
#> [2]  validation-mae:10.143387 
#> [3]  validation-mae:7.338691 
#> [4]  validation-mae:5.341355 
#> [5]  validation-mae:3.889570 
#> [6]  validation-mae:2.836564 
#> [7]  validation-mae:2.107599 
#> [8]  validation-mae:1.659913 
#> [9]  validation-mae:1.303094 
#> [10] validation-mae:1.047100
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_mutate()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 13.5 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist, 
#>     verbose = 1, eval_metric = "mae", nthread = 1, objective = "reg:squarederror")
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mae", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.print.evaluation(period = print_every_n)
#>   cb.evaluation.log()
#> # of features: 3 
#> niter: 10
#> nfeatures : 3 
#> evaluation_log:
#>     iter validation_mae
#>        1      14.049658
#>        2      10.143387
#> ---                    
#>        9       1.303094
#>       10       1.047100


# workflow interface
# Validation is missing predictors
# Errors correctly and shows missing predictors
car_rec <-
  recipes::recipe(mpg ~ disp + hp + cyl, data = mtcars) |>
  recipes::step_mutate(hp_time_disp = hp * disp) |>
  recipes::update_role(cyl, new_role = 'id')

reg_fit <-
  boost_tree(trees = 10, mode = "regression") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = mtcars,
             verbose = 1)

workflows::workflow(car_rec, reg_fit) |>
  fit(data = mtcars)
#> Error in `parsnip::xgb_train()`:
#> ! `validation` is missing column(s): `hp_time_disp`


##################### Classification

# Fit Interface
# Validation as Prop
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = 0.1) |>
  fit(Class ~ A + B, data = two_class_dat)
#> [1]  validation-logloss:0.578083 
#> [2]  validation-logloss:0.510125 
#> [3]  validation-logloss:0.473173 
#> [4]  validation-logloss:0.458129 
#> [5]  validation-logloss:0.456636 
#> [6]  validation-logloss:0.453387 
#> [7]  validation-logloss:0.459217 
#> [8]  validation-logloss:0.460738 
#> [9]  validation-logloss:0.471640 
#> [10] validation-logloss:0.478731


# Fit Interface
# Validation as dataframe
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = two_class_dat[1:3,]) |>
  fit(Class ~ A + B, data = two_class_dat)
#> [1]  validation-logloss:0.635771 
#> [2]  validation-logloss:0.624781 
#> [3]  validation-logloss:0.627140 
#> [4]  validation-logloss:0.591926 
#> [5]  validation-logloss:0.589791 
#> [6]  validation-logloss:0.535597 
#> [7]  validation-logloss:0.526764 
#> [8]  validation-logloss:0.527316 
#> [9]  validation-logloss:0.521822 
#> [10] validation-logloss:0.545214


# XY Interface
# Validation as Prop
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = 0.1) |>
  fit_xy(y = two_class_dat$Class, x = two_class_dat[,-3])
#> [1]  validation-logloss:0.576877 
#> [2]  validation-logloss:0.515689 
#> [3]  validation-logloss:0.490827 
#> [4]  validation-logloss:0.478492 
#> [5]  validation-logloss:0.474726 
#> [6]  validation-logloss:0.472732 
#> [7]  validation-logloss:0.480608 
#> [8]  validation-logloss:0.487469 
#> [9]  validation-logloss:0.484555 
#> [10] validation-logloss:0.483709


# XY Interface
# Validation as dataframe
# Correctly errors
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = two_class_dat) |>
  fit_xy(y = two_class_dat$Class, x = two_class_dat[,-3])
#> Error in `parsnip::xgb_train()`:
#> ! `y` must be named when `validation` is a dataframe


# Recipe Interface
# Validation as dataframe missing vars
class_rec <-
  recipes::recipe(Class ~ A + B, data = two_class_dat) |>
  recipes::step_mutate(a_b = A * B) |>
  recipes::update_role(A, new_role = 'id')

reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = two_class_dat,
             verbose = 1)

workflows::workflow(class_rec, reg_fit) |>
  fit(data = two_class_dat)
#> Error in `parsnip::xgb_train()`:
#> ! `validation` is missing column(s): `a_b`


# Recipe Interface
# Validation as prop
class_rec <-
  recipes::recipe(Class ~ A + B, data = two_class_dat) |>
  recipes::step_mutate(a_b = A * B) |>
  recipes::update_role(A, new_role = 'id')

reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = 0.1,
             verbose = 1)

workflows::workflow(class_rec, reg_fit) |>
  fit(data = two_class_dat)
#> [1]  validation-mae:0.439349 
#> [2]  validation-mae:0.401310 
#> [3]  validation-mae:0.370106 
#> [4]  validation-mae:0.344822 
#> [5]  validation-mae:0.321370 
#> [6]  validation-mae:0.312411 
#> [7]  validation-mae:0.305393 
#> [8]  validation-mae:0.299720 
#> [9]  validation-mae:0.290905 
#> [10] validation-mae:0.285693
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_mutate()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 31.7 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist, 
#>     verbose = 1, eval_metric = "mae", nthread = 1, objective = "binary:logistic")
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mae", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.print.evaluation(period = print_every_n)
#>   cb.evaluation.log()
#> # of features: 2 
#> niter: 10
#> nfeatures : 2 
#> evaluation_log:
#>     iter validation_mae
#>        1      0.4393490
#>        2      0.4013096
#> ---                    
#>        9      0.2909054
#>       10      0.2856933


# Recipe Interface
# Validation as baked data
class_rec <-
  recipes::recipe(Class ~ A + B, data = two_class_dat) |>
  recipes::step_mutate(a_b = A * B) |>
  recipes::update_role(A, new_role = 'id')

reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = recipes::prep(class_rec) |> recipes::bake(new_data = two_class_dat),
             verbose = 1)

workflows::workflow(class_rec, reg_fit) |>
  fit(data = two_class_dat)
#> [1]  validation-mae:0.421717 
#> [2]  validation-mae:0.364818 
#> [3]  validation-mae:0.323719 
#> [4]  validation-mae:0.294142 
#> [5]  validation-mae:0.268857 
#> [6]  validation-mae:0.249749 
#> [7]  validation-mae:0.235352 
#> [8]  validation-mae:0.223877 
#> [9]  validation-mae:0.214782 
#> [10] validation-mae:0.208277
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_mutate()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 32.5 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist, 
#>     verbose = 1, eval_metric = "mae", nthread = 1, objective = "binary:logistic")
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mae", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.print.evaluation(period = print_every_n)
#>   cb.evaluation.log()
#> # of features: 2 
#> niter: 10
#> nfeatures : 2 
#> evaluation_log:
#>     iter validation_mae
#>        1      0.4217165
#>        2      0.3648176
#> ---                    
#>        9      0.2147818
#>       10      0.2082772


########### Multiclass


# Fit Interface
# Validation as prop
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = 0.1) |>
  fit(species ~ bill_length_mm + bill_depth_mm, data = penguins)
#> [1]  validation-mlogloss:0.756842 
#> [2]  validation-mlogloss:0.551547 
#> [3]  validation-mlogloss:0.414491 
#> [4]  validation-mlogloss:0.315939 
#> [5]  validation-mlogloss:0.248686 
#> [6]  validation-mlogloss:0.205213 
#> [7]  validation-mlogloss:0.172328 
#> [8]  validation-mlogloss:0.146289 
#> [9]  validation-mlogloss:0.119266 
#> [10] validation-mlogloss:0.099206


# Fit Interface
# Validation as dataframe
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = penguins) |>
  fit(species ~ bill_length_mm + bill_depth_mm, data = penguins)
#> [1]  validation-mlogloss:0.741214 
#> [2]  validation-mlogloss:0.532345 
#> [3]  validation-mlogloss:0.395555 
#> [4]  validation-mlogloss:0.301706 
#> [5]  validation-mlogloss:0.232278 
#> [6]  validation-mlogloss:0.182016 
#> [7]  validation-mlogloss:0.144608 
#> [8]  validation-mlogloss:0.116754 
#> [9]  validation-mlogloss:0.095399 
#> [10] validation-mlogloss:0.079533

#XY interface
#Validation as prop
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = 0.1) |>
  fit_xy(y = penguins$species, x = penguins[,c(3,4)])
#> [1]  validation-mlogloss:0.777307 
#> [2]  validation-mlogloss:0.583527 
#> [3]  validation-mlogloss:0.462445 
#> [4]  validation-mlogloss:0.364660 
#> [5]  validation-mlogloss:0.291666 
#> [6]  validation-mlogloss:0.238673 
#> [7]  validation-mlogloss:0.201141 
#> [8]  validation-mlogloss:0.173881 
#> [9]  validation-mlogloss:0.150524 
#> [10] validation-mlogloss:0.130961

# XY interface
# Validation as dataframe
# Correctly errors
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = penguins) |>
  fit_xy(y = penguins$species, x = penguins[,c(3,4)])
#> Error in `parsnip::xgb_train()`:
#> ! `y` must be named when `validation` is a dataframe


# Recipe Interface
# Validation as dataframe missing vars
class_rec <-
  recipes::recipe(species ~ bill_length_mm + bill_depth_mm, data = penguins) |>
  recipes::step_mutate(b_2 = bill_length_mm * 2)

class_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             eval_metric = "mlogloss",
             validation = penguins,
             verbose = 1)

workflows::workflow(class_rec, class_fit) |>
  fit(data = penguins)
#> Error in `parsnip::xgb_train()`:
#> ! `validation` is missing column(s): `b_2`


# Recipe Interface
# Validation as baked dataframe
class_rec <-
  recipes::recipe(species ~ bill_length_mm + bill_depth_mm, data = penguins) |>
  recipes::step_mutate(b_2 = bill_length_mm * 2)

class_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             eval_metric = "mlogloss",
             validation = recipes::prep(class_rec) |> recipes::bake(new_data = penguins),
             verbose = 1)

workflows::workflow(class_rec, class_fit) |>
  fit(data = penguins)
#> [1]  validation-mlogloss:0.741382 
#> [2]  validation-mlogloss:0.531399 
#> [3]  validation-mlogloss:0.394591 
#> [4]  validation-mlogloss:0.299946 
#> [5]  validation-mlogloss:0.231754 
#> [6]  validation-mlogloss:0.181127 
#> [7]  validation-mlogloss:0.143169 
#> [8]  validation-mlogloss:0.115443 
#> [9]  validation-mlogloss:0.094732 
#> [10] validation-mlogloss:0.078962
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_mutate()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 42.4 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist, 
#>     verbose = 1, eval_metric = "mlogloss", nthread = 1, objective = "multi:softprob", 
#>     num_class = 3L)
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mlogloss", nthread = "1", objective = "multi:softprob", num_class = "3", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.print.evaluation(period = print_every_n)
#>   cb.evaluation.log()
#> # of features: 3 
#> niter: 10
#> nfeatures : 3 
#> evaluation_log:
#>     iter validation_mlogloss
#>        1          0.74138198
#>        2          0.53139859
#> ---                         
#>        9          0.09473197
#>       10          0.07896238


# Recipe Interface
# Validation as prop
class_rec <-
  recipes::recipe(species ~ bill_length_mm + bill_depth_mm, data = penguins) |>
  recipes::step_mutate(b_2 = bill_length_mm * 2)

class_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             eval_metric = "mlogloss",
             validation = 0.2,
             verbose = 1)

workflows::workflow(class_rec, class_fit) |>
  fit(data = penguins)
#> [1]  validation-mlogloss:0.749688 
#> [2]  validation-mlogloss:0.554280 
#> [3]  validation-mlogloss:0.428803 
#> [4]  validation-mlogloss:0.338456 
#> [5]  validation-mlogloss:0.280671 
#> [6]  validation-mlogloss:0.236317 
#> [7]  validation-mlogloss:0.204086 
#> [8]  validation-mlogloss:0.183910 
#> [9]  validation-mlogloss:0.168322 
#> [10] validation-mlogloss:0.156616
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_mutate()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 40.5 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist, 
#>     verbose = 1, eval_metric = "mlogloss", nthread = 1, objective = "multi:softprob", 
#>     num_class = 3L)
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mlogloss", nthread = "1", objective = "multi:softprob", num_class = "3", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.print.evaluation(period = print_every_n)
#>   cb.evaluation.log()
#> # of features: 3 
#> niter: 10
#> nfeatures : 3 
#> evaluation_log:
#>     iter validation_mlogloss
#>        1           0.7496881
#>        2           0.5542800
#> ---                         
#>        9           0.1683216
#>       10           0.1566161

^{Created on 2022-08-22 by the reprex package (v2.0.1)}

joeycouse · 2022-08-28T20:27:06Z

@simonpcouch any chance you could review this PR and provide some suggestions if you think it needs more work?

simonpcouch · 2022-08-29T12:32:01Z

@joeycouse, yes, sure thing!

Apologies for the lack of response here. Will slot out some time sooner than later to spend some time with this. I've gone ahead and approved running GHA workflows so that you can make use of our CI.

simonpcouch · 2022-08-30T18:01:54Z

Thanks again for all of your effort here, @joeycouse! I appreciate you documenting this proposal so thoroughly.

We've chatted about these changes and decided that we'll close this PR. There are some changes to machinery that we'd like to keep stable, some changes that we'd like to instead lean more heavily on existing machinery for, and some differences in interface compared to where we'd like to land with this feature.

You note two changes that we'd definitely be interested in incorporating:

...case weights weren't being passed to the internal validation set in the current implementation.

Edge case handling for when mode is classification and engine is xgboost

If you're up for introducing smaller PRs that integrate just these changes, I'd be happy to review.

Regardless, we'll definitely be acknowledging your work here in the release notes when we introduce functionality like this in the future. Thanks again for the PR and raising the points of consideration that you have.

github-actions · 2022-09-14T01:19:57Z

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

joeycouse added 2 commits July 20, 2022 12:00

allow user to pass dataframe to validation argument of xgboost.

11282db

Preserve weights for internal validation set.

typo

940502b

joeycouse changed the title ~~Draft PR for to allow dataframe as validation arg in xgboost~~ Draft PR to allow dataframe as validation arg in xgboost Jul 20, 2022

joeycouse added 10 commits August 11, 2022 11:47

remove name atr of vectors

4338a6c

preserve colnames of y when y is provided as data.frame with numeric …

e06547e

…only

remove attempt to add attributes to vectors

b3e249d

* add improved error handling

5940ef7

* handles case where validation has additional columns * add error for when `y` is vector and validation is a dataframe - fit_xy() issue.

Merge branch 'tidymodels:main' into xgb_validation

246e1c0

support for dataframe as arg for validation

a46404c

update logic to preserve colnames of y

9037d87

preserve colname of y when y is a factor

10e5133

Merge branch 'xgb_validation' of https://github.com/joeycouse/parsnip …

beda057

…into xgb_validation

update test to expect named numeric matrix

c175d3e

joeycouse marked this pull request as ready for review August 22, 2022 18:11

joeycouse changed the title ~~Draft PR to allow dataframe as validation arg in xgboost~~ PR to allow dataframe as validation arg in xgboost Aug 22, 2022

dfsnow mentioned this pull request Aug 25, 2022

Feature idea - provide custom validation sets for early stopping tidymodels/bonsai#48

Open

simonpcouch self-requested a review August 29, 2022 12:30

joeycouse mentioned this pull request Aug 30, 2022

Internals bug? will_make_matrix() returns False when given a matrix #786

Closed

Merge branch 'tidymodels:main' into xgb_validation

66f4219

simonpcouch closed this Aug 30, 2022

joeycouse mentioned this pull request Aug 30, 2022

pass weights to xgboost internal validation set #803

Open

github-actions bot locked and limited conversation to collaborators Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR to allow dataframe as `validation` arg in `xgboost` #771

PR to allow dataframe as `validation` arg in `xgboost` #771

joeycouse commented Jul 20, 2022

joeycouse commented Aug 22, 2022

joeycouse commented Aug 28, 2022

simonpcouch commented Aug 29, 2022

simonpcouch commented Aug 30, 2022

github-actions bot commented Sep 14, 2022

PR to allow dataframe as validation arg in xgboost #771

PR to allow dataframe as validation arg in xgboost #771

Conversation

joeycouse commented Jul 20, 2022

Examples

joeycouse commented Aug 22, 2022

joeycouse commented Aug 28, 2022

simonpcouch commented Aug 29, 2022

simonpcouch commented Aug 30, 2022

github-actions bot commented Sep 14, 2022

PR to allow dataframe as `validation` arg in `xgboost` #771

PR to allow dataframe as `validation` arg in `xgboost` #771