-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more flexibility with stratification / grouped sampling #211
Comments
Thanks so much for this feedback @ColinConwell. As we look back at how this feature has been used by folks with various constraints, we are now considering how to expose that argument to users so it could be changed in some situations. We'll want to generate warnings based on our opinionated take of what is "too low" and include documentation to indicate that lowering the argument (currently |
Totally understandable! I think a strong warning and a conservative default in this case would be sufficient to buttress the opinion, but the flexibility, I think, is definitely key as well. If the user persists past what is reasonable or pragmatic, it's not then a fault of the software. The ability to set pool should also, I reckon, cover the vast majority of use cases I was considering. Thanks again for the reply. |
Thanks for your patience @ColinConwell! 🙌 You can now get this feature by installing from GitHub: devtools::install_github("tidymodels/rsample") It is now implemented for the main user-facing resampling functions such as library(tidyverse)
library(rsample)
df <- tibble(x = rnorm(60), label = rep(letters[1:12], each = 5))
mc_cv(df, v = 3, strata = label)
#> Warning: Too little data to stratify. Unstratified resampling
#> # Monte Carlo cross-validation (0.75/0.25) with 25 resamples using stratification
#> # A tibble: 25 x 2
#> splits id
#> <list> <chr>
#> 1 <split [45/15]> Resample01
#> 2 <split [45/15]> Resample02
#> 3 <split [45/15]> Resample03
#> 4 <split [45/15]> Resample04
#> 5 <split [45/15]> Resample05
#> 6 <split [45/15]> Resample06
#> 7 <split [45/15]> Resample07
#> 8 <split [45/15]> Resample08
#> 9 <split [45/15]> Resample09
#> 10 <split [45/15]> Resample10
#> # … with 15 more rows
mc_cv(df, v = 3, strata = label, pool = 0.05)
#> Warning: Stratifying groups that make up 5% of the data may be statistically risky.
#> Consider increasing `pool` to at least 0.1
#> # Monte Carlo cross-validation (0.75/0.25) with 25 resamples using stratification
#> # A tibble: 25 x 2
#> splits id
#> <list> <chr>
#> 1 <split [36/24]> Resample01
#> 2 <split [36/24]> Resample02
#> 3 <split [36/24]> Resample03
#> 4 <split [36/24]> Resample04
#> 5 <split [36/24]> Resample05
#> 6 <split [36/24]> Resample06
#> 7 <split [36/24]> Resample07
#> 8 <split [36/24]> Resample08
#> 9 <split [36/24]> Resample09
#> 10 <split [36/24]> Resample10
#> # … with 15 more rows Created on 2021-03-18 by the reprex package (v1.0.0) |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Thank you for all your hard work on rsample. I know this has been a topic of some debate in the past (and I do see a number of closed and open issues pertaining to this), but I've been finding the combination of single stratification and pooling ceiling in rsample to be noxiously limiting. Working in an empirical domain where we often have condition-rich designs, but small samples (e.g. neuroimaging), it's imperative we be able to perform stratified resampling with a bit more flexibility across multiple groups.
I'm consistently running into problems that result in "Warning message: Too little data to stratify", despite knowing exactly how much data I expect to be in each stratum and being willing to accept the limitations thereof.
Since the deprecation of broom's bootstrap (which allowed resampling on a grouped tibble), rsample is increasingly the main package that facilitates these operations. I definitely empathize with many of the issues that result from giving the user more freedom to specify their stratification strategy, but the opposite means I'm having to effectively reimplement the wheel to get the flexibility I need, which seems counterproductive.
Perhaps just a series of very robust warnings will be sufficient to wipe your hands of the issues that result from users abusing this flexibility? My thanks in advance for your consideration. I appreciate it!
The text was updated successfully, but these errors were encountered: