Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR to allow dataframe as validation arg in xgboost #771

Closed
wants to merge 13 commits into from

Conversation

joeycouse
Copy link

Just a draft of allowing user to pass a dataframe as the validation arg for xgboost. From what I can tell, the main issue that prevents from passing a df with extra columns is the fact that when y is passed to as_xgb_data():

as_xgb_data <- function(x, y, validation = 0, weights = NULL, event_level = "first", ...) {

The y vector is unnamed, so you can't just select the columns from x and which column corresponds to y to pass to
xgb.DMatrix()

Also case weights weren't being passed to the internal validation set in the current implementation.

Would be happy to further develop this PR with some advice from your team

Examples

> # Works 
> reg_fit <-
+   boost_tree(trees = 10, mode = "regression") %>%
+   set_engine("xgboost", 
+              eval_metric = "mae", 
+              validation = mtcars[1:3,],
+              verbose = 1) %>%
+   fit(mpg ~ ., data = mtcars)
[1]	validation-mae:14.428947 
[2]	validation-mae:10.406661 
[3]	validation-mae:7.885722 
[4]	validation-mae:5.953925 
[5]	validation-mae:4.558322 
[6]	validation-mae:3.400241 
[7]	validation-mae:2.560631 
[8]	validation-mae:1.757373 
[9]	validation-mae:1.187967 
[10]	validation-mae:0.787739 
> 
> reg_fit <-
+   boost_tree(trees = 10, mode = "regression") %>%
+   set_engine("xgboost", 
+              eval_metric = "mae", 
+              validation = 0.2,
+              verbose = 1) %>%
+   fit(mpg ~ ., data = mtcars)
[1]	validation-mae:14.203231 
[2]	validation-mae:9.799767 
[3]	validation-mae:7.313254 
[4]	validation-mae:5.302311 
[5]	validation-mae:3.932890 
[6]	validation-mae:2.969718 
[7]	validation-mae:2.525703 
[8]	validation-mae:2.261265 
[9]	validation-mae:2.169489 
[10]	validation-mae:2.301616 
> 
> 
> # Errors
> car_rec <- 
+   recipe(mpg ~ disp + hp + cyl, data = mtcars) |> 
+   update_role(cyl, new_role = 'id')
> 
> 
> reg_fit <-
+     boost_tree(trees = 10, mode = "regression") %>%
+     set_engine("xgboost", 
+                eval_metric = "mae", 
+                validation = mtcars[,c(1,2,3,4)],
+                verbose = 1) 
> 
> result <- 
+   workflows::workflow(car_rec, reg_fit) |> 
+   fit(data = mtcars)
Error in `parsnip::xgb_train()`:
! `validation` should contain 3 columns
Run `rlang::last_error()` to see where the error occurred.
> 
> mtcars_random <-
+   mtcars |> 
+   mutate(random = runif(nrow(mtcars), 0, 10))
> 
> reg_fit <-
+   boost_tree(trees = 10, mode = "regression") %>%
+   set_engine("xgboost", 
+              eval_metric = "mae", 
+              validation = mtcars_random,
+              verbose = 1) %>%
+   fit(mpg ~ ., data = mtcars)
Error in `parsnip::xgb_train()`:
! `validation` should contain 11 columns
Run `rlang::last_error()` to see where the error occurred.

@joeycouse joeycouse changed the title Draft PR for to allow dataframe as validation arg in xgboost Draft PR to allow dataframe as validation arg in xgboost Jul 20, 2022
@joeycouse
Copy link
Author

This PR will allow a dataframe to be passed as arg to validation for xgboost. Some changes of note.

  • Modified internals function will_make_matrix() to return True if input is numeric vector or matrix Internals bug? will_make_matrix() returns False when given a matrix #786
  • Updated corresponding test to expected named numeric matrix
  • Edge case handling for when mode is classification and engine is xgboost
  • saves factor column name as col_name attr for when mode is classification and engine = "xgboost" so it is available when passed to xgb_train()

Please let me know if your team is interested in merge. If so I can add tests. Thanks!

Examples below:

library(tidyverse)
library(tidymodels)

# formula interface
# Validation as dataframe
reg_fit <-
  boost_tree(trees = 10, mode = "regression") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = mtcars[1:3,],
             verbose = 1) %>%
  fit(mpg ~ hp + disp + cyl, data = mtcars)
#> [1]  validation-mae:14.428947 
#> [2]  validation-mae:11.075748 
#> [3]  validation-mae:8.496362 
#> [4]  validation-mae:5.707752 
#> [5]  validation-mae:4.324680 
#> [6]  validation-mae:3.336770 
#> [7]  validation-mae:1.835582 
#> [8]  validation-mae:1.294387 
#> [9]  validation-mae:0.872375 
#> [10] validation-mae:0.673549


# formula interface
# Validation as prop
reg_fit <-
  boost_tree(trees = 10, mode = "regression") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = 0.2,
             verbose = 1) %>%
  fit(mpg ~ ., data = mtcars)
#> [1]  validation-mae:12.133571 
#> [2]  validation-mae:8.886966 
#> [3]  validation-mae:6.678021 
#> [4]  validation-mae:4.898591 
#> [5]  validation-mae:3.576521 
#> [6]  validation-mae:2.651404 
#> [7]  validation-mae:2.103294 
#> [8]  validation-mae:1.825942 
#> [9]  validation-mae:1.769561 
#> [10] validation-mae:1.770711


# xy interface
# Validation as dataframe
# Correctly errors

res <- boost_tree(trees = 10, mode = "regression") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = mtcars,
             verbose = 1) %>%
  fit_xy(x = mtcars[,-1],  y = mtcars$mpg)
#> Error in `parsnip::xgb_train()`:
#> ! `y` must be named when `validation` is a dataframe


# xy interface
# Validation as prop
reg_fit <-
  boost_tree(trees = 10, mode = "regression") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = 0.2,
             verbose = 1) %>%
  fit_xy(x = mtcars[,-1],  y = mtcars$mpg)
#> [1]  validation-mae:14.356102 
#> [2]  validation-mae:10.397969 
#> [3]  validation-mae:7.662204 
#> [4]  validation-mae:5.727999 
#> [5]  validation-mae:4.343752 
#> [6]  validation-mae:3.546708 
#> [7]  validation-mae:3.135106 
#> [8]  validation-mae:2.820952 
#> [9]  validation-mae:2.541687 
#> [10] validation-mae:2.407671


# workflow interface
# Validation as dataframe
car_rec <-
  recipes::recipe(mpg ~ disp + hp + cyl, data = mtcars) |>
  recipes::step_mutate(hp_time_disp = hp * disp) |>
  recipes::update_role(cyl, new_role = 'id')

reg_fit <-
    boost_tree(trees = 10, mode = "regression") %>%
    set_engine("xgboost",
               eval_metric = "mae",
               validation = recipes::prep(car_rec) |> recipes::bake(new_data = mtcars),
               verbose = 1)

workflows::workflow(car_rec, reg_fit) |>
  fit(data = mtcars)
#> [1]  validation-mae:14.049658 
#> [2]  validation-mae:10.143387 
#> [3]  validation-mae:7.338691 
#> [4]  validation-mae:5.341355 
#> [5]  validation-mae:3.889570 
#> [6]  validation-mae:2.836564 
#> [7]  validation-mae:2.107599 
#> [8]  validation-mae:1.659913 
#> [9]  validation-mae:1.303094 
#> [10] validation-mae:1.047100
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_mutate()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 13.5 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist, 
#>     verbose = 1, eval_metric = "mae", nthread = 1, objective = "reg:squarederror")
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mae", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.print.evaluation(period = print_every_n)
#>   cb.evaluation.log()
#> # of features: 3 
#> niter: 10
#> nfeatures : 3 
#> evaluation_log:
#>     iter validation_mae
#>        1      14.049658
#>        2      10.143387
#> ---                    
#>        9       1.303094
#>       10       1.047100


# workflow interface
# Validation is missing predictors
# Errors correctly and shows missing predictors
car_rec <-
  recipes::recipe(mpg ~ disp + hp + cyl, data = mtcars) |>
  recipes::step_mutate(hp_time_disp = hp * disp) |>
  recipes::update_role(cyl, new_role = 'id')

reg_fit <-
  boost_tree(trees = 10, mode = "regression") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = mtcars,
             verbose = 1)

workflows::workflow(car_rec, reg_fit) |>
  fit(data = mtcars)
#> Error in `parsnip::xgb_train()`:
#> ! `validation` is missing column(s): `hp_time_disp`


##################### Classification

# Fit Interface
# Validation as Prop
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = 0.1) |>
  fit(Class ~ A + B, data = two_class_dat)
#> [1]  validation-logloss:0.578083 
#> [2]  validation-logloss:0.510125 
#> [3]  validation-logloss:0.473173 
#> [4]  validation-logloss:0.458129 
#> [5]  validation-logloss:0.456636 
#> [6]  validation-logloss:0.453387 
#> [7]  validation-logloss:0.459217 
#> [8]  validation-logloss:0.460738 
#> [9]  validation-logloss:0.471640 
#> [10] validation-logloss:0.478731


# Fit Interface
# Validation as dataframe
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = two_class_dat[1:3,]) |>
  fit(Class ~ A + B, data = two_class_dat)
#> [1]  validation-logloss:0.635771 
#> [2]  validation-logloss:0.624781 
#> [3]  validation-logloss:0.627140 
#> [4]  validation-logloss:0.591926 
#> [5]  validation-logloss:0.589791 
#> [6]  validation-logloss:0.535597 
#> [7]  validation-logloss:0.526764 
#> [8]  validation-logloss:0.527316 
#> [9]  validation-logloss:0.521822 
#> [10] validation-logloss:0.545214


# XY Interface
# Validation as Prop
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = 0.1) |>
  fit_xy(y = two_class_dat$Class, x = two_class_dat[,-3])
#> [1]  validation-logloss:0.576877 
#> [2]  validation-logloss:0.515689 
#> [3]  validation-logloss:0.490827 
#> [4]  validation-logloss:0.478492 
#> [5]  validation-logloss:0.474726 
#> [6]  validation-logloss:0.472732 
#> [7]  validation-logloss:0.480608 
#> [8]  validation-logloss:0.487469 
#> [9]  validation-logloss:0.484555 
#> [10] validation-logloss:0.483709


# XY Interface
# Validation as dataframe
# Correctly errors
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = two_class_dat) |>
  fit_xy(y = two_class_dat$Class, x = two_class_dat[,-3])
#> Error in `parsnip::xgb_train()`:
#> ! `y` must be named when `validation` is a dataframe


# Recipe Interface
# Validation as dataframe missing vars
class_rec <-
  recipes::recipe(Class ~ A + B, data = two_class_dat) |>
  recipes::step_mutate(a_b = A * B) |>
  recipes::update_role(A, new_role = 'id')

reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = two_class_dat,
             verbose = 1)

workflows::workflow(class_rec, reg_fit) |>
  fit(data = two_class_dat)
#> Error in `parsnip::xgb_train()`:
#> ! `validation` is missing column(s): `a_b`


# Recipe Interface
# Validation as prop
class_rec <-
  recipes::recipe(Class ~ A + B, data = two_class_dat) |>
  recipes::step_mutate(a_b = A * B) |>
  recipes::update_role(A, new_role = 'id')

reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = 0.1,
             verbose = 1)

workflows::workflow(class_rec, reg_fit) |>
  fit(data = two_class_dat)
#> [1]  validation-mae:0.439349 
#> [2]  validation-mae:0.401310 
#> [3]  validation-mae:0.370106 
#> [4]  validation-mae:0.344822 
#> [5]  validation-mae:0.321370 
#> [6]  validation-mae:0.312411 
#> [7]  validation-mae:0.305393 
#> [8]  validation-mae:0.299720 
#> [9]  validation-mae:0.290905 
#> [10] validation-mae:0.285693
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_mutate()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 31.7 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist, 
#>     verbose = 1, eval_metric = "mae", nthread = 1, objective = "binary:logistic")
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mae", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.print.evaluation(period = print_every_n)
#>   cb.evaluation.log()
#> # of features: 2 
#> niter: 10
#> nfeatures : 2 
#> evaluation_log:
#>     iter validation_mae
#>        1      0.4393490
#>        2      0.4013096
#> ---                    
#>        9      0.2909054
#>       10      0.2856933


# Recipe Interface
# Validation as baked data
class_rec <-
  recipes::recipe(Class ~ A + B, data = two_class_dat) |>
  recipes::step_mutate(a_b = A * B) |>
  recipes::update_role(A, new_role = 'id')

reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             eval_metric = "mae",
             validation = recipes::prep(class_rec) |> recipes::bake(new_data = two_class_dat),
             verbose = 1)

workflows::workflow(class_rec, reg_fit) |>
  fit(data = two_class_dat)
#> [1]  validation-mae:0.421717 
#> [2]  validation-mae:0.364818 
#> [3]  validation-mae:0.323719 
#> [4]  validation-mae:0.294142 
#> [5]  validation-mae:0.268857 
#> [6]  validation-mae:0.249749 
#> [7]  validation-mae:0.235352 
#> [8]  validation-mae:0.223877 
#> [9]  validation-mae:0.214782 
#> [10] validation-mae:0.208277
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_mutate()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 32.5 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist, 
#>     verbose = 1, eval_metric = "mae", nthread = 1, objective = "binary:logistic")
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mae", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.print.evaluation(period = print_every_n)
#>   cb.evaluation.log()
#> # of features: 2 
#> niter: 10
#> nfeatures : 2 
#> evaluation_log:
#>     iter validation_mae
#>        1      0.4217165
#>        2      0.3648176
#> ---                    
#>        9      0.2147818
#>       10      0.2082772


########### Multiclass


# Fit Interface
# Validation as prop
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = 0.1) |>
  fit(species ~ bill_length_mm + bill_depth_mm, data = penguins)
#> [1]  validation-mlogloss:0.756842 
#> [2]  validation-mlogloss:0.551547 
#> [3]  validation-mlogloss:0.414491 
#> [4]  validation-mlogloss:0.315939 
#> [5]  validation-mlogloss:0.248686 
#> [6]  validation-mlogloss:0.205213 
#> [7]  validation-mlogloss:0.172328 
#> [8]  validation-mlogloss:0.146289 
#> [9]  validation-mlogloss:0.119266 
#> [10] validation-mlogloss:0.099206


# Fit Interface
# Validation as dataframe
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = penguins) |>
  fit(species ~ bill_length_mm + bill_depth_mm, data = penguins)
#> [1]  validation-mlogloss:0.741214 
#> [2]  validation-mlogloss:0.532345 
#> [3]  validation-mlogloss:0.395555 
#> [4]  validation-mlogloss:0.301706 
#> [5]  validation-mlogloss:0.232278 
#> [6]  validation-mlogloss:0.182016 
#> [7]  validation-mlogloss:0.144608 
#> [8]  validation-mlogloss:0.116754 
#> [9]  validation-mlogloss:0.095399 
#> [10] validation-mlogloss:0.079533

#XY interface
#Validation as prop
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = 0.1) |>
  fit_xy(y = penguins$species, x = penguins[,c(3,4)])
#> [1]  validation-mlogloss:0.777307 
#> [2]  validation-mlogloss:0.583527 
#> [3]  validation-mlogloss:0.462445 
#> [4]  validation-mlogloss:0.364660 
#> [5]  validation-mlogloss:0.291666 
#> [6]  validation-mlogloss:0.238673 
#> [7]  validation-mlogloss:0.201141 
#> [8]  validation-mlogloss:0.173881 
#> [9]  validation-mlogloss:0.150524 
#> [10] validation-mlogloss:0.130961

# XY interface
# Validation as dataframe
# Correctly errors
reg_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             verbose = 1,
             validation = penguins) |>
  fit_xy(y = penguins$species, x = penguins[,c(3,4)])
#> Error in `parsnip::xgb_train()`:
#> ! `y` must be named when `validation` is a dataframe


# Recipe Interface
# Validation as dataframe missing vars
class_rec <-
  recipes::recipe(species ~ bill_length_mm + bill_depth_mm, data = penguins) |>
  recipes::step_mutate(b_2 = bill_length_mm * 2)

class_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             eval_metric = "mlogloss",
             validation = penguins,
             verbose = 1)

workflows::workflow(class_rec, class_fit) |>
  fit(data = penguins)
#> Error in `parsnip::xgb_train()`:
#> ! `validation` is missing column(s): `b_2`


# Recipe Interface
# Validation as baked dataframe
class_rec <-
  recipes::recipe(species ~ bill_length_mm + bill_depth_mm, data = penguins) |>
  recipes::step_mutate(b_2 = bill_length_mm * 2)

class_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             eval_metric = "mlogloss",
             validation = recipes::prep(class_rec) |> recipes::bake(new_data = penguins),
             verbose = 1)

workflows::workflow(class_rec, class_fit) |>
  fit(data = penguins)
#> [1]  validation-mlogloss:0.741382 
#> [2]  validation-mlogloss:0.531399 
#> [3]  validation-mlogloss:0.394591 
#> [4]  validation-mlogloss:0.299946 
#> [5]  validation-mlogloss:0.231754 
#> [6]  validation-mlogloss:0.181127 
#> [7]  validation-mlogloss:0.143169 
#> [8]  validation-mlogloss:0.115443 
#> [9]  validation-mlogloss:0.094732 
#> [10] validation-mlogloss:0.078962
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_mutate()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 42.4 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist, 
#>     verbose = 1, eval_metric = "mlogloss", nthread = 1, objective = "multi:softprob", 
#>     num_class = 3L)
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mlogloss", nthread = "1", objective = "multi:softprob", num_class = "3", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.print.evaluation(period = print_every_n)
#>   cb.evaluation.log()
#> # of features: 3 
#> niter: 10
#> nfeatures : 3 
#> evaluation_log:
#>     iter validation_mlogloss
#>        1          0.74138198
#>        2          0.53139859
#> ---                         
#>        9          0.09473197
#>       10          0.07896238


# Recipe Interface
# Validation as prop
class_rec <-
  recipes::recipe(species ~ bill_length_mm + bill_depth_mm, data = penguins) |>
  recipes::step_mutate(b_2 = bill_length_mm * 2)

class_fit <-
  boost_tree(trees = 10, mode = "classification") %>%
  set_engine("xgboost",
             eval_metric = "mlogloss",
             validation = 0.2,
             verbose = 1)

workflows::workflow(class_rec, class_fit) |>
  fit(data = penguins)
#> [1]  validation-mlogloss:0.749688 
#> [2]  validation-mlogloss:0.554280 
#> [3]  validation-mlogloss:0.428803 
#> [4]  validation-mlogloss:0.338456 
#> [5]  validation-mlogloss:0.280671 
#> [6]  validation-mlogloss:0.236317 
#> [7]  validation-mlogloss:0.204086 
#> [8]  validation-mlogloss:0.183910 
#> [9]  validation-mlogloss:0.168322 
#> [10] validation-mlogloss:0.156616
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_mutate()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> ##### xgb.Booster
#> raw: 40.5 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 10, watchlist = x$watchlist, 
#>     verbose = 1, eval_metric = "mlogloss", nthread = 1, objective = "multi:softprob", 
#>     num_class = 3L)
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", eval_metric = "mlogloss", nthread = "1", objective = "multi:softprob", num_class = "3", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.print.evaluation(period = print_every_n)
#>   cb.evaluation.log()
#> # of features: 3 
#> niter: 10
#> nfeatures : 3 
#> evaluation_log:
#>     iter validation_mlogloss
#>        1           0.7496881
#>        2           0.5542800
#> ---                         
#>        9           0.1683216
#>       10           0.1566161

Created on 2022-08-22 by the reprex package (v2.0.1)

@joeycouse joeycouse marked this pull request as ready for review August 22, 2022 18:11
@joeycouse joeycouse changed the title Draft PR to allow dataframe as validation arg in xgboost PR to allow dataframe as validation arg in xgboost Aug 22, 2022
@joeycouse
Copy link
Author

@simonpcouch any chance you could review this PR and provide some suggestions if you think it needs more work?

@simonpcouch simonpcouch self-requested a review August 29, 2022 12:30
@simonpcouch
Copy link
Contributor

@joeycouse, yes, sure thing!

Apologies for the lack of response here. Will slot out some time sooner than later to spend some time with this. I've gone ahead and approved running GHA workflows so that you can make use of our CI.

@simonpcouch
Copy link
Contributor

Thanks again for all of your effort here, @joeycouse! I appreciate you documenting this proposal so thoroughly.

We've chatted about these changes and decided that we'll close this PR. There are some changes to machinery that we'd like to keep stable, some changes that we'd like to instead lean more heavily on existing machinery for, and some differences in interface compared to where we'd like to land with this feature.

You note two changes that we'd definitely be interested in incorporating:

...case weights weren't being passed to the internal validation set in the current implementation.

Edge case handling for when mode is classification and engine is xgboost

If you're up for introducing smaller PRs that integrate just these changes, I'd be happy to review.

Regardless, we'll definitely be acknowledging your work here in the release notes when we introduce functionality like this in the future. Thanks again for the PR and raising the points of consideration that you have.

@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Sep 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants