Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with select_by_pct_loss #543

Closed
frankhezemans opened this issue Sep 21, 2022 · 2 comments · Fixed by #568
Closed

issue with select_by_pct_loss #543

frankhezemans opened this issue Sep 21, 2022 · 2 comments · Fixed by #568
Labels
bug an unexpected problem or unintended behavior

Comments

@frankhezemans
Copy link

The problem

I believe there is a bug in select_by_pct_loss, such that it returns the model with the greatest loss within the limit, not necessarily the most simple model whose loss is within the limit. The problem seems to be that in the last few lines of the function, the models whose loss is within the limit are ranked by loss (in descending order), and then the first row of the data frame (i.e. model with greatest loss) is returned.

Reproducible example

library(tidymodels)
#> Warning: package 'purrr' was built under R version 4.0.5
data("example_ames_knn")

my_metric <- "rmse"
my_limit <- 10

# the following should return the least complex model (i.e. highest value of K) that has no more
# than a 10% loss of RMSE:
select_by_pct_loss(
  ames_grid_search,
  metric = my_metric,
  limit = my_limit,
  desc(K)
)
#> # A tibble: 1 x 13
#>       K weigh~1 dist_~2   lon   lat .metric .esti~3   mean     n std_err .config
#>   <int> <chr>     <dbl> <int> <int> <chr>   <chr>    <dbl> <int>   <dbl> <chr>  
#> 1    35 optimal    1.32     8     1 rmse    standa~ 0.0785    10 0.00347 Prepro~
#> # ... with 2 more variables: .best <dbl>, .loss <dbl>, and abbreviated variable
#> #   names 1: weight_func, 2: dist_power, 3: .estimator

# However, doing this manually gives a different result:

res <-
  collect_metrics(ames_grid_search) %>%
  dplyr::filter(.metric == my_metric & !is.na(mean))

best_metric <- min(res$mean, na.rm = TRUE)

res <-
  res %>%
  dplyr::mutate(
    .best = best_metric,
    .loss = (mean - best_metric) / best_metric * 100
  ) %>%
  dplyr::arrange(desc(K))

best_index <- which(res$.loss == 0)

res %>%
  dplyr::slice(1:best_index) %>%
  dplyr::filter(.loss < my_limit)
#> # A tibble: 3 x 13
#>       K weigh~1 dist_~2   lon   lat .metric .esti~3   mean     n std_err .config
#>   <int> <chr>     <dbl> <int> <int> <chr>   <chr>    <dbl> <int>   <dbl> <chr>  
#> 1    40 triang~   0.167    11     7 rmse    standa~ 0.0778    10 0.00332 Prepro~
#> 2    35 optimal   1.32      8     1 rmse    standa~ 0.0785    10 0.00347 Prepro~
#> 3    33 triwei~   0.511    10     3 rmse    standa~ 0.0728    10 0.00337 Prepro~
#> # ... with 2 more variables: .best <dbl>, .loss <dbl>, and abbreviated variable
#> #   names 1: weight_func, 2: dist_power, 3: .estimator

# Based on the above output, the least complex model within 10% loss of RMSE is actually K=40,
# not K=35. However, in the current code, the results are additionally ranked by loss (in
# descending order), and then the first row of the data frame (i.e. model with greatest loss) is
# returned.

res %>%
  dplyr::slice(1:best_index) %>%
  dplyr::filter(.loss < my_limit) %>%
  dplyr::arrange(desc(.loss)) %>%
  dplyr::slice(1)
#> # A tibble: 1 x 13
#>       K weigh~1 dist_~2   lon   lat .metric .esti~3   mean     n std_err .config
#>   <int> <chr>     <dbl> <int> <int> <chr>   <chr>    <dbl> <int>   <dbl> <chr>  
#> 1    35 optimal    1.32     8     1 rmse    standa~ 0.0785    10 0.00347 Prepro~
#> # ... with 2 more variables: .best <dbl>, .loss <dbl>, and abbreviated variable
#> #   names 1: weight_func, 2: dist_power, 3: .estimator

Created on 2022-09-21 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.0.2 (2020-06-22)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.1252
#>  ctype    English_United States.1252
#>  tz       Europe/Berlin
#>  date     2022-09-21
#>  pandoc   2.11.4 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  ! package      * version    date (UTC) lib source
#>  P assertthat     0.2.1      2019-03-21 [?] CRAN (R 4.0.5)
#>  P backports      1.4.1      2021-12-13 [?] CRAN (R 4.0.5)
#>  P broom        * 1.0.1      2022-08-29 [?] CRAN (R 4.0.2)
#>    class          7.3-17     2020-04-26 [2] CRAN (R 4.0.2)
#>  P cli            3.4.0      2022-09-08 [?] CRAN (R 4.0.2)
#>    codetools      0.2-16     2018-12-24 [2] CRAN (R 4.0.2)
#>  P colorspace     2.0-3      2022-02-21 [?] CRAN (R 4.0.5)
#>  P DBI            1.1.3      2022-06-18 [?] CRAN (R 4.0.2)
#>  P dials        * 1.0.0      2022-06-14 [?] CRAN (R 4.0.2)
#>  P DiceDesign     1.9        2021-02-13 [?] CRAN (R 4.0.5)
#>  P digest         0.6.29     2021-12-01 [?] CRAN (R 4.0.5)
#>  P dplyr        * 1.0.10     2022-09-01 [?] CRAN (R 4.0.2)
#>  P ellipsis       0.3.2      2021-04-29 [?] CRAN (R 4.0.5)
#>  P evaluate       0.16       2022-08-09 [?] CRAN (R 4.0.2)
#>  P fansi          1.0.3      2022-03-24 [?] CRAN (R 4.0.5)
#>  P fastmap        1.1.0      2021-01-25 [?] CRAN (R 4.0.5)
#>  P foreach        1.5.2      2022-02-02 [?] CRAN (R 4.0.5)
#>  P fs             1.5.2      2021-12-08 [?] CRAN (R 4.0.5)
#>  P furrr          0.3.1      2022-08-15 [?] CRAN (R 4.0.2)
#>  P future         1.28.0     2022-09-02 [?] CRAN (R 4.0.2)
#>  P future.apply   1.9.1      2022-09-07 [?] CRAN (R 4.0.2)
#>  P generics       0.1.3      2022-07-05 [?] CRAN (R 4.0.2)
#>  P ggplot2      * 3.3.6      2022-05-03 [?] CRAN (R 4.0.2)
#>  P globals        0.16.1     2022-08-28 [?] CRAN (R 4.0.2)
#>  P glue           1.6.2      2022-02-24 [?] CRAN (R 4.0.5)
#>  P gower          1.0.0      2022-02-03 [?] CRAN (R 4.0.5)
#>  P GPfit          1.0-8      2019-02-08 [?] CRAN (R 4.0.5)
#>  P gtable         0.3.1      2022-09-01 [?] CRAN (R 4.0.2)
#>  P hardhat        1.2.0      2022-06-30 [?] CRAN (R 4.0.2)
#>  P highr          0.9        2021-04-16 [?] CRAN (R 4.0.5)
#>  P htmltools      0.5.3      2022-07-18 [?] CRAN (R 4.0.2)
#>  P infer        * 1.0.3      2022-08-22 [?] CRAN (R 4.0.2)
#>  P ipred          0.9-13     2022-06-02 [?] CRAN (R 4.0.2)
#>  P iterators      1.0.14     2022-02-05 [?] CRAN (R 4.0.5)
#>  P knitr          1.40       2022-08-24 [?] CRAN (R 4.0.2)
#>    lattice        0.20-41    2020-04-02 [2] CRAN (R 4.0.2)
#>  P lava           1.6.10     2021-09-02 [?] CRAN (R 4.0.5)
#>  P lhs            1.1.5      2022-03-22 [?] CRAN (R 4.0.5)
#>  P lifecycle      1.0.2      2022-09-09 [?] CRAN (R 4.0.2)
#>  P listenv        0.8.0      2019-12-05 [?] CRAN (R 4.0.5)
#>  P lubridate      1.8.0      2021-10-07 [?] CRAN (R 4.0.5)
#>  P magrittr       2.0.3      2022-03-30 [?] CRAN (R 4.0.5)
#>    MASS           7.3-51.6   2020-04-26 [2] CRAN (R 4.0.2)
#>    Matrix         1.2-18     2019-11-27 [2] CRAN (R 4.0.2)
#>  P modeldata    * 1.0.1      2022-09-06 [?] CRAN (R 4.0.2)
#>  P munsell        0.5.0      2018-06-12 [?] CRAN (R 4.0.5)
#>    nnet           7.3-14     2020-04-26 [2] CRAN (R 4.0.2)
#>  P parallelly     1.32.1     2022-07-21 [?] CRAN (R 4.0.2)
#>  P parsnip      * 1.0.1      2022-08-18 [?] CRAN (R 4.0.2)
#>  P pillar         1.8.1      2022-08-19 [?] CRAN (R 4.0.2)
#>  P pkgconfig      2.0.3      2019-09-22 [?] CRAN (R 4.0.5)
#>  P prodlim        2019.11.13 2019-11-17 [?] CRAN (R 4.0.5)
#>  P purrr        * 0.3.4      2020-04-17 [?] CRAN (R 4.0.5)
#>  P R6             2.5.1      2021-08-19 [?] CRAN (R 4.0.5)
#>  P Rcpp           1.0.9      2022-07-08 [?] CRAN (R 4.0.2)
#>  P recipes      * 1.0.1      2022-07-07 [?] CRAN (R 4.0.2)
#>  P reprex         2.0.2      2022-08-17 [?] CRAN (R 4.0.2)
#>  P rlang          1.0.5      2022-08-31 [?] CRAN (R 4.0.2)
#>  P rmarkdown      2.16       2022-08-24 [?] CRAN (R 4.0.2)
#>    rpart          4.1-15     2019-04-12 [2] CRAN (R 4.0.2)
#>  P rsample      * 1.1.0      2022-08-08 [?] CRAN (R 4.0.2)
#>  P rstudioapi     0.14       2022-08-22 [?] CRAN (R 4.0.2)
#>  P scales       * 1.2.1      2022-08-20 [?] CRAN (R 4.0.2)
#>  P sessioninfo    1.2.2      2021-12-06 [?] CRAN (R 4.0.5)
#>  P stringi        1.7.8      2022-07-11 [?] CRAN (R 4.0.2)
#>  P stringr        1.4.1      2022-08-20 [?] CRAN (R 4.0.2)
#>    survival       3.1-12     2020-04-10 [2] CRAN (R 4.0.2)
#>  P tibble       * 3.1.8      2022-07-22 [?] CRAN (R 4.0.2)
#>  P tidymodels   * 1.0.0      2022-07-13 [?] CRAN (R 4.0.2)
#>  P tidyr        * 1.2.1      2022-09-08 [?] CRAN (R 4.0.2)
#>  P tidyselect     1.1.2      2022-02-21 [?] CRAN (R 4.0.5)
#>  P timeDate       4021.104   2022-07-19 [?] CRAN (R 4.0.2)
#>  P tune         * 1.0.0      2022-07-07 [?] CRAN (R 4.0.2)
#>  P utf8           1.2.2      2021-07-24 [?] CRAN (R 4.0.5)
#>  P vctrs          0.4.1      2022-04-13 [?] CRAN (R 4.0.5)
#>  P withr          2.5.0      2022-03-03 [?] CRAN (R 4.0.5)
#>  P workflows    * 1.0.0      2022-07-05 [?] CRAN (R 4.0.2)
#>  P workflowsets * 1.0.0      2022-07-12 [?] CRAN (R 4.0.2)
#>  P xfun           0.33       2022-09-12 [?] CRAN (R 4.0.2)
#>  P yaml           2.3.5      2022-02-21 [?] CRAN (R 4.0.5)
#>  P yardstick    * 1.1.0      2022-09-07 [?] CRAN (R 4.0.2)
#> 
#>  [1] C:/Users/frahez/Documents/predict_ki/renv/library/R-4.0/x86_64-w64-mingw32
#>  [2] C:/Program Files/R/R-4.0.2/library
#> 
#>  P -- Loaded and on-disk path mismatch.
#> 
#> ------------------------------------------------------------------------------
@EmilHvitfeldt EmilHvitfeldt added the bug an unexpected problem or unintended behavior label Sep 21, 2022
@simonpcouch
Copy link
Contributor

Thank you for the reprex and thorough description! Just put in a fix for this.☃️

@github-actions
Copy link

github-actions bot commented Nov 4, 2022

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Nov 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants