You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe there is a bug in select_by_pct_loss, such that it returns the model with the greatest loss within the limit, not necessarily the most simple model whose loss is within the limit. The problem seems to be that in the last few lines of the function, the models whose loss is within the limit are ranked by loss (in descending order), and then the first row of the data frame (i.e. model with greatest loss) is returned.
Reproducible example
library(tidymodels)
#> Warning: package 'purrr' was built under R version 4.0.5
data("example_ames_knn")
my_metric<-"rmse"my_limit<-10# the following should return the least complex model (i.e. highest value of K) that has no more# than a 10% loss of RMSE:
select_by_pct_loss(
ames_grid_search,
metric=my_metric,
limit=my_limit,
desc(K)
)
#> # A tibble: 1 x 13#> K weigh~1 dist_~2 lon lat .metric .esti~3 mean n std_err .config#> <int> <chr> <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 35 optimal 1.32 8 1 rmse standa~ 0.0785 10 0.00347 Prepro~#> # ... with 2 more variables: .best <dbl>, .loss <dbl>, and abbreviated variable#> # names 1: weight_func, 2: dist_power, 3: .estimator# However, doing this manually gives a different result:res<-
collect_metrics(ames_grid_search) %>%
dplyr::filter(.metric==my_metric&!is.na(mean))
best_metric<- min(res$mean, na.rm=TRUE)
res<-res %>%
dplyr::mutate(
.best=best_metric,
.loss= (mean-best_metric) /best_metric*100
) %>%
dplyr::arrange(desc(K))
best_index<- which(res$.loss==0)
res %>%
dplyr::slice(1:best_index) %>%
dplyr::filter(.loss<my_limit)
#> # A tibble: 3 x 13#> K weigh~1 dist_~2 lon lat .metric .esti~3 mean n std_err .config#> <int> <chr> <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 40 triang~ 0.167 11 7 rmse standa~ 0.0778 10 0.00332 Prepro~#> 2 35 optimal 1.32 8 1 rmse standa~ 0.0785 10 0.00347 Prepro~#> 3 33 triwei~ 0.511 10 3 rmse standa~ 0.0728 10 0.00337 Prepro~#> # ... with 2 more variables: .best <dbl>, .loss <dbl>, and abbreviated variable#> # names 1: weight_func, 2: dist_power, 3: .estimator# Based on the above output, the least complex model within 10% loss of RMSE is actually K=40,# not K=35. However, in the current code, the results are additionally ranked by loss (in# descending order), and then the first row of the data frame (i.e. model with greatest loss) is# returned.res %>%
dplyr::slice(1:best_index) %>%
dplyr::filter(.loss<my_limit) %>%
dplyr::arrange(desc(.loss)) %>%
dplyr::slice(1)
#> # A tibble: 1 x 13#> K weigh~1 dist_~2 lon lat .metric .esti~3 mean n std_err .config#> <int> <chr> <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 35 optimal 1.32 8 1 rmse standa~ 0.0785 10 0.00347 Prepro~#> # ... with 2 more variables: .best <dbl>, .loss <dbl>, and abbreviated variable#> # names 1: weight_func, 2: dist_power, 3: .estimator
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
The problem
I believe there is a bug in
select_by_pct_loss
, such that it returns the model with the greatest loss within the limit, not necessarily the most simple model whose loss is within the limit. The problem seems to be that in the last few lines of the function, the models whose loss is within the limit are ranked by loss (in descending order), and then the first row of the data frame (i.e. model with greatest loss) is returned.Reproducible example
Created on 2022-09-21 with reprex v2.0.2
Session info
The text was updated successfully, but these errors were encountered: