issue with select_by_pct_loss #543

frankhezemans · 2022-09-21T16:46:22Z

The problem

I believe there is a bug in select_by_pct_loss, such that it returns the model with the greatest loss within the limit, not necessarily the most simple model whose loss is within the limit. The problem seems to be that in the last few lines of the function, the models whose loss is within the limit are ranked by loss (in descending order), and then the first row of the data frame (i.e. model with greatest loss) is returned.

Reproducible example

library(tidymodels)
#> Warning: package 'purrr' was built under R version 4.0.5
data("example_ames_knn")

my_metric <- "rmse"
my_limit <- 10

# the following should return the least complex model (i.e. highest value of K) that has no more
# than a 10% loss of RMSE:
select_by_pct_loss(
  ames_grid_search,
  metric = my_metric,
  limit = my_limit,
  desc(K)
)
#> # A tibble: 1 x 13
#>       K weigh~1 dist_~2   lon   lat .metric .esti~3   mean     n std_err .config
#>   <int> <chr>     <dbl> <int> <int> <chr>   <chr>    <dbl> <int>   <dbl> <chr>  
#> 1    35 optimal    1.32     8     1 rmse    standa~ 0.0785    10 0.00347 Prepro~
#> # ... with 2 more variables: .best <dbl>, .loss <dbl>, and abbreviated variable
#> #   names 1: weight_func, 2: dist_power, 3: .estimator

# However, doing this manually gives a different result:

res <-
  collect_metrics(ames_grid_search) %>%
  dplyr::filter(.metric == my_metric & !is.na(mean))

best_metric <- min(res$mean, na.rm = TRUE)

res <-
  res %>%
  dplyr::mutate(
    .best = best_metric,
    .loss = (mean - best_metric) / best_metric * 100
  ) %>%
  dplyr::arrange(desc(K))

best_index <- which(res$.loss == 0)

res %>%
  dplyr::slice(1:best_index) %>%
  dplyr::filter(.loss < my_limit)
#> # A tibble: 3 x 13
#>       K weigh~1 dist_~2   lon   lat .metric .esti~3   mean     n std_err .config
#>   <int> <chr>     <dbl> <int> <int> <chr>   <chr>    <dbl> <int>   <dbl> <chr>  
#> 1    40 triang~   0.167    11     7 rmse    standa~ 0.0778    10 0.00332 Prepro~
#> 2    35 optimal   1.32      8     1 rmse    standa~ 0.0785    10 0.00347 Prepro~
#> 3    33 triwei~   0.511    10     3 rmse    standa~ 0.0728    10 0.00337 Prepro~
#> # ... with 2 more variables: .best <dbl>, .loss <dbl>, and abbreviated variable
#> #   names 1: weight_func, 2: dist_power, 3: .estimator

# Based on the above output, the least complex model within 10% loss of RMSE is actually K=40,
# not K=35. However, in the current code, the results are additionally ranked by loss (in
# descending order), and then the first row of the data frame (i.e. model with greatest loss) is
# returned.

res %>%
  dplyr::slice(1:best_index) %>%
  dplyr::filter(.loss < my_limit) %>%
  dplyr::arrange(desc(.loss)) %>%
  dplyr::slice(1)
#> # A tibble: 1 x 13
#>       K weigh~1 dist_~2   lon   lat .metric .esti~3   mean     n std_err .config
#>   <int> <chr>     <dbl> <int> <int> <chr>   <chr>    <dbl> <int>   <dbl> <chr>  
#> 1    35 optimal    1.32     8     1 rmse    standa~ 0.0785    10 0.00347 Prepro~
#> # ... with 2 more variables: .best <dbl>, .loss <dbl>, and abbreviated variable
#> #   names 1: weight_func, 2: dist_power, 3: .estimator

^{Created on 2022-09-21 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.0.2 (2020-06-22)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.1252
#>  ctype    English_United States.1252
#>  tz       Europe/Berlin
#>  date     2022-09-21
#>  pandoc   2.11.4 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  ! package      * version    date (UTC) lib source
#>  P assertthat     0.2.1      2019-03-21 [?] CRAN (R 4.0.5)
#>  P backports      1.4.1      2021-12-13 [?] CRAN (R 4.0.5)
#>  P broom        * 1.0.1      2022-08-29 [?] CRAN (R 4.0.2)
#>    class          7.3-17     2020-04-26 [2] CRAN (R 4.0.2)
#>  P cli            3.4.0      2022-09-08 [?] CRAN (R 4.0.2)
#>    codetools      0.2-16     2018-12-24 [2] CRAN (R 4.0.2)
#>  P colorspace     2.0-3      2022-02-21 [?] CRAN (R 4.0.5)
#>  P DBI            1.1.3      2022-06-18 [?] CRAN (R 4.0.2)
#>  P dials        * 1.0.0      2022-06-14 [?] CRAN (R 4.0.2)
#>  P DiceDesign     1.9        2021-02-13 [?] CRAN (R 4.0.5)
#>  P digest         0.6.29     2021-12-01 [?] CRAN (R 4.0.5)
#>  P dplyr        * 1.0.10     2022-09-01 [?] CRAN (R 4.0.2)
#>  P ellipsis       0.3.2      2021-04-29 [?] CRAN (R 4.0.5)
#>  P evaluate       0.16       2022-08-09 [?] CRAN (R 4.0.2)
#>  P fansi          1.0.3      2022-03-24 [?] CRAN (R 4.0.5)
#>  P fastmap        1.1.0      2021-01-25 [?] CRAN (R 4.0.5)
#>  P foreach        1.5.2      2022-02-02 [?] CRAN (R 4.0.5)
#>  P fs             1.5.2      2021-12-08 [?] CRAN (R 4.0.5)
#>  P furrr          0.3.1      2022-08-15 [?] CRAN (R 4.0.2)
#>  P future         1.28.0     2022-09-02 [?] CRAN (R 4.0.2)
#>  P future.apply   1.9.1      2022-09-07 [?] CRAN (R 4.0.2)
#>  P generics       0.1.3      2022-07-05 [?] CRAN (R 4.0.2)
#>  P ggplot2      * 3.3.6      2022-05-03 [?] CRAN (R 4.0.2)
#>  P globals        0.16.1     2022-08-28 [?] CRAN (R 4.0.2)
#>  P glue           1.6.2      2022-02-24 [?] CRAN (R 4.0.5)
#>  P gower          1.0.0      2022-02-03 [?] CRAN (R 4.0.5)
#>  P GPfit          1.0-8      2019-02-08 [?] CRAN (R 4.0.5)
#>  P gtable         0.3.1      2022-09-01 [?] CRAN (R 4.0.2)
#>  P hardhat        1.2.0      2022-06-30 [?] CRAN (R 4.0.2)
#>  P highr          0.9        2021-04-16 [?] CRAN (R 4.0.5)
#>  P htmltools      0.5.3      2022-07-18 [?] CRAN (R 4.0.2)
#>  P infer        * 1.0.3      2022-08-22 [?] CRAN (R 4.0.2)
#>  P ipred          0.9-13     2022-06-02 [?] CRAN (R 4.0.2)
#>  P iterators      1.0.14     2022-02-05 [?] CRAN (R 4.0.5)
#>  P knitr          1.40       2022-08-24 [?] CRAN (R 4.0.2)
#>    lattice        0.20-41    2020-04-02 [2] CRAN (R 4.0.2)
#>  P lava           1.6.10     2021-09-02 [?] CRAN (R 4.0.5)
#>  P lhs            1.1.5      2022-03-22 [?] CRAN (R 4.0.5)
#>  P lifecycle      1.0.2      2022-09-09 [?] CRAN (R 4.0.2)
#>  P listenv        0.8.0      2019-12-05 [?] CRAN (R 4.0.5)
#>  P lubridate      1.8.0      2021-10-07 [?] CRAN (R 4.0.5)
#>  P magrittr       2.0.3      2022-03-30 [?] CRAN (R 4.0.5)
#>    MASS           7.3-51.6   2020-04-26 [2] CRAN (R 4.0.2)
#>    Matrix         1.2-18     2019-11-27 [2] CRAN (R 4.0.2)
#>  P modeldata    * 1.0.1      2022-09-06 [?] CRAN (R 4.0.2)
#>  P munsell        0.5.0      2018-06-12 [?] CRAN (R 4.0.5)
#>    nnet           7.3-14     2020-04-26 [2] CRAN (R 4.0.2)
#>  P parallelly     1.32.1     2022-07-21 [?] CRAN (R 4.0.2)
#>  P parsnip      * 1.0.1      2022-08-18 [?] CRAN (R 4.0.2)
#>  P pillar         1.8.1      2022-08-19 [?] CRAN (R 4.0.2)
#>  P pkgconfig      2.0.3      2019-09-22 [?] CRAN (R 4.0.5)
#>  P prodlim        2019.11.13 2019-11-17 [?] CRAN (R 4.0.5)
#>  P purrr        * 0.3.4      2020-04-17 [?] CRAN (R 4.0.5)
#>  P R6             2.5.1      2021-08-19 [?] CRAN (R 4.0.5)
#>  P Rcpp           1.0.9      2022-07-08 [?] CRAN (R 4.0.2)
#>  P recipes      * 1.0.1      2022-07-07 [?] CRAN (R 4.0.2)
#>  P reprex         2.0.2      2022-08-17 [?] CRAN (R 4.0.2)
#>  P rlang          1.0.5      2022-08-31 [?] CRAN (R 4.0.2)
#>  P rmarkdown      2.16       2022-08-24 [?] CRAN (R 4.0.2)
#>    rpart          4.1-15     2019-04-12 [2] CRAN (R 4.0.2)
#>  P rsample      * 1.1.0      2022-08-08 [?] CRAN (R 4.0.2)
#>  P rstudioapi     0.14       2022-08-22 [?] CRAN (R 4.0.2)
#>  P scales       * 1.2.1      2022-08-20 [?] CRAN (R 4.0.2)
#>  P sessioninfo    1.2.2      2021-12-06 [?] CRAN (R 4.0.5)
#>  P stringi        1.7.8      2022-07-11 [?] CRAN (R 4.0.2)
#>  P stringr        1.4.1      2022-08-20 [?] CRAN (R 4.0.2)
#>    survival       3.1-12     2020-04-10 [2] CRAN (R 4.0.2)
#>  P tibble       * 3.1.8      2022-07-22 [?] CRAN (R 4.0.2)
#>  P tidymodels   * 1.0.0      2022-07-13 [?] CRAN (R 4.0.2)
#>  P tidyr        * 1.2.1      2022-09-08 [?] CRAN (R 4.0.2)
#>  P tidyselect     1.1.2      2022-02-21 [?] CRAN (R 4.0.5)
#>  P timeDate       4021.104   2022-07-19 [?] CRAN (R 4.0.2)
#>  P tune         * 1.0.0      2022-07-07 [?] CRAN (R 4.0.2)
#>  P utf8           1.2.2      2021-07-24 [?] CRAN (R 4.0.5)
#>  P vctrs          0.4.1      2022-04-13 [?] CRAN (R 4.0.5)
#>  P withr          2.5.0      2022-03-03 [?] CRAN (R 4.0.5)
#>  P workflows    * 1.0.0      2022-07-05 [?] CRAN (R 4.0.2)
#>  P workflowsets * 1.0.0      2022-07-12 [?] CRAN (R 4.0.2)
#>  P xfun           0.33       2022-09-12 [?] CRAN (R 4.0.2)
#>  P yaml           2.3.5      2022-02-21 [?] CRAN (R 4.0.5)
#>  P yardstick    * 1.1.0      2022-09-07 [?] CRAN (R 4.0.2)
#> 
#>  [1] C:/Users/frahez/Documents/predict_ki/renv/library/R-4.0/x86_64-w64-mingw32
#>  [2] C:/Program Files/R/R-4.0.2/library
#> 
#>  P -- Loaded and on-disk path mismatch.
#> 
#> ------------------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

simonpcouch · 2022-10-14T19:32:33Z

Thank you for the reprex and thorough description! Just put in a fix for this.☃️

github-actions · 2022-11-04T00:51:05Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

EmilHvitfeldt added the bug an unexpected problem or unintended behavior label Sep 21, 2022

simonpcouch mentioned this issue Oct 14, 2022

correct sorting in select_by_pct_loss() #568

Merged

simonpcouch closed this as completed in #568 Oct 17, 2022

github-actions bot locked and limited conversation to collaborators Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue with select_by_pct_loss #543

issue with select_by_pct_loss #543

frankhezemans commented Sep 21, 2022

simonpcouch commented Oct 14, 2022

github-actions bot commented Nov 4, 2022

issue with select_by_pct_loss #543

issue with select_by_pct_loss #543

Comments

frankhezemans commented Sep 21, 2022

The problem

Reproducible example

simonpcouch commented Oct 14, 2022

github-actions bot commented Nov 4, 2022