-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
group_loo_cv()
and group_vfold_cv()
#324
Comments
I personally would be concerned about @juliasilge can say more, but I think a big part of why Leave-one-group-out doesn't necessarily have the same concerns, and so doesn't need to subclass I'm going to defer to @juliasilge entirely on I like the idea of adding repeats. Thanks for the really thoughtful & clear issue! |
Thanks for the detailed response.
Could you elaborate on this? I am unclear on the different type of resampling you are referring to. |
So
Now when you're leaving a group out instead, you're performing leave-one-group-out CV, which has more separation between the training sets in the different folds and so produces more stable assessments for tuning. While it's similar to LOO (if every observation has a unique group, it's identical), in practice assuming that your groups are each a reasonable % of the data I'm not aware of the same concerns about statistical properties. We wouldn't need to block it from |
This distinction is (I believe, I don't have insider knowledge or anything) why SciKit has a kfold and group kfold, a shuffle split and a group shuffle split, but a leave one out and a leave one group out. |
I completely agree on the statistical issues with standard LOO. Following from your SciKit sampling methods, I would say I would also say we do not currently have a default "group kfold" because our defaults for Sorry, the terminology is getting a bit confusing. |
Also, FWIW, I think it would be reasonable for the default to be |
Yeah, completely agree there.
This is the one where I have a concern. I think a To be very clear: I'm not against a function with a more obvious name for doing leave one group out, I just think it shouldn't be called
Agreed 😆
We actually do this in spatialsample but decided to not here in #293 . The reasoning is that in spatialsample, you often don't know what the maximum number of groups is going to be, whereas here it's always in your data -- it's either As with the rest of the conversation around |
I like the name |
Believe we're on the same page there. I'm not making the call about changing |
I agree with @mikemahoney218 that we don't want to use the class I don't want to jump in to changing that default for In the meantime, @mattwarkentin would you like to do a PR to the docs for library(rsample)
data(ames, package = "modeldata")
## also called "leave-group-out":
group_vfold_cv(ames, group = Neighborhood)
#> # Group 28-fold cross-validation
#> # A tibble: 28 × 2
#> splits id
#> <list> <chr>
#> 1 <split [2882/48]> Resample01
#> 2 <split [2765/165]> Resample02
#> 3 <split [2886/44]> Resample03
#> 4 <split [2748/182]> Resample04
#> 5 <split [2900/30]> Resample05
#> 6 <split [2805/125]> Resample06
#> 7 <split [2764/166]> Resample07
#> 8 <split [2663/267]> Resample08
#> 9 <split [2799/131]> Resample09
#> 10 <split [2691/239]> Resample10
#> # … with 18 more rows Created on 2022-07-01 by the reprex package (v2.0.1) |
Okay, just catching up on this issue and the related issues/PRs. Everything sounds good and it sounds like we are all on the same page. Are you still interested in a PR to update the documents in the ways described above? |
If you have the time & interest, definitely! |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Hi @mikemahoney218 and @juliasilge,
Super excited for all of the great work @mikemahoney218 has been doing to boost
{rsample}
group sampling support.I was thinking about this last night...it seems to me that the current default implementation of
group_vfold_cv()
is really thegroup_*
counterpart toloo_cv()
. Withv = NULL
(the default forgroup_vfold_cv()
), you get group LOO-CV sampling. Over time this may be confusing as users may grow to expect thegroup_*
version of something to return a sampling pattern similar to the default for its non-grouped sibling.I am wondering if it makes sense to make the following changes to achieve a logically consistent API for grouped and non-grouped pairs:
group_loo_cv()
based on the current default implementation ofgroup_vfold_cv(v = NULL)
With no changes to
group_vfold_cv()
, this could be as simple as:That way users will find a
group_*
implementation of all relevant standard sampling methods which I think helps the mental model for the grouped verisons.v
forgroup_vfold_cv()
to be equivalent to that ofvfold_cv()
(i.e.v = 10
) and also adding arepeats = 1
argument. I think over time people will come to expect that thegroup_*
implementation is otherwise equivalent to the non-grouped version, except for the grouping. Forgroup_vfold_cv()
this currently isn't the case, where the default implementation is actually returning groupedloo_cv()
. I realize LOO is really a special case of standard v-fold CV and users can opt in tovfold_cv()
-like sampling, but its not currently the default which may cause mental friction.I just think for consistency it makes sense for
group_vfold_cv()
to have a signature like:This would be a behavioural change that could throw a warning/message for some time and then eventually remove the warning/message. I think I just like the idea of
group_*
being otherwise equivalent sampling schemes to their non-group counterparts.Thoughts?
The text was updated successfully, but these errors were encountered: