Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more flexibility with stratification / grouped sampling #211

Closed
ColinConwell opened this issue Jan 26, 2021 · 4 comments · Fixed by #229
Closed

more flexibility with stratification / grouped sampling #211

ColinConwell opened this issue Jan 26, 2021 · 4 comments · Fixed by #229
Labels
feature a feature request or enhancement

Comments

@ColinConwell
Copy link

Thank you for all your hard work on rsample. I know this has been a topic of some debate in the past (and I do see a number of closed and open issues pertaining to this), but I've been finding the combination of single stratification and pooling ceiling in rsample to be noxiously limiting. Working in an empirical domain where we often have condition-rich designs, but small samples (e.g. neuroimaging), it's imperative we be able to perform stratified resampling with a bit more flexibility across multiple groups.

I'm consistently running into problems that result in "Warning message: Too little data to stratify", despite knowing exactly how much data I expect to be in each stratum and being willing to accept the limitations thereof.

Since the deprecation of broom's bootstrap (which allowed resampling on a grouped tibble), rsample is increasingly the main package that facilitates these operations. I definitely empathize with many of the issues that result from giving the user more freedom to specify their stratification strategy, but the opposite means I'm having to effectively reimplement the wheel to get the flexibility I need, which seems counterproductive.

Perhaps just a series of very robust warnings will be sufficient to wipe your hands of the issues that result from users abusing this flexibility? My thanks in advance for your consideration. I appreciate it!

@juliasilge
Copy link
Member

juliasilge commented Jan 28, 2021

Thanks so much for this feedback @ColinConwell. As we look back at how this feature has been used by folks with various constraints, we are now considering how to expose that argument to users so it could be changed in some situations.

We'll want to generate warnings based on our opinionated take of what is "too low" and include documentation to indicate that lowering the argument (currently pool in make_strata()) may result in... 💣 ☢️ ☠️

@juliasilge juliasilge added feature a feature request or enhancement and removed discussion labels Jan 28, 2021
@ColinConwell
Copy link
Author

ColinConwell commented Jan 29, 2021

Totally understandable! I think a strong warning and a conservative default in this case would be sufficient to buttress the opinion, but the flexibility, I think, is definitely key as well. If the user persists past what is reasonable or pragmatic, it's not then a fault of the software. The ability to set pool should also, I reckon, cover the vast majority of use cases I was considering. Thanks again for the reply.

@juliasilge
Copy link
Member

Thanks for your patience @ColinConwell! 🙌 You can now get this feature by installing from GitHub:

devtools::install_github("tidymodels/rsample")

It is now implemented for the main user-facing resampling functions such as vfold_cv(), mc_cv(), and friends:

library(tidyverse)
library(rsample)

df <- tibble(x = rnorm(60), label = rep(letters[1:12], each = 5))

mc_cv(df, v = 3, strata = label)
#> Warning: Too little data to stratify. Unstratified resampling
#> # Monte Carlo cross-validation (0.75/0.25) with 25 resamples  using stratification 
#> # A tibble: 25 x 2
#>    splits          id        
#>    <list>          <chr>     
#>  1 <split [45/15]> Resample01
#>  2 <split [45/15]> Resample02
#>  3 <split [45/15]> Resample03
#>  4 <split [45/15]> Resample04
#>  5 <split [45/15]> Resample05
#>  6 <split [45/15]> Resample06
#>  7 <split [45/15]> Resample07
#>  8 <split [45/15]> Resample08
#>  9 <split [45/15]> Resample09
#> 10 <split [45/15]> Resample10
#> # … with 15 more rows
mc_cv(df, v = 3, strata = label, pool = 0.05)
#> Warning: Stratifying groups that make up 5% of the data may be statistically risky.
#> Consider increasing `pool` to at least 0.1
#> # Monte Carlo cross-validation (0.75/0.25) with 25 resamples  using stratification 
#> # A tibble: 25 x 2
#>    splits          id        
#>    <list>          <chr>     
#>  1 <split [36/24]> Resample01
#>  2 <split [36/24]> Resample02
#>  3 <split [36/24]> Resample03
#>  4 <split [36/24]> Resample04
#>  5 <split [36/24]> Resample05
#>  6 <split [36/24]> Resample06
#>  7 <split [36/24]> Resample07
#>  8 <split [36/24]> Resample08
#>  9 <split [36/24]> Resample09
#> 10 <split [36/24]> Resample10
#> # … with 15 more rows

Created on 2021-03-18 by the reprex package (v1.0.0)

@github-actions
Copy link

github-actions bot commented Apr 2, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants