`group_loo_cv()` and `group_vfold_cv()` #324

mattwarkentin · 2022-06-30T15:18:34Z

Super excited for all of the great work @mikemahoney218 has been doing to boost {rsample} group sampling support.

I was thinking about this last night...it seems to me that the current default implementation of group_vfold_cv() is really the group_* counterpart to loo_cv(). With v = NULL (the default for group_vfold_cv()), you get group LOO-CV sampling. Over time this may be confusing as users may grow to expect the group_* version of something to return a sampling pattern similar to the default for its non-grouped sibling.

I am wondering if it makes sense to make the following changes to achieve a logically consistent API for grouped and non-grouped pairs:

Create group_loo_cv() based on the current default implementation of group_vfold_cv(v = NULL)

With no changes to group_vfold_cv(), this could be as simple as:

group_loo_cv <- function(data, group, ...) {
  group_vfold_cv(data, group, v = NULL, ...)
}

That way users will find a group_* implementation of all relevant standard sampling methods which I think helps the mental model for the grouped verisons.

However, I wonder if its worth possibly changing the default value of v for group_vfold_cv() to be equivalent to that of vfold_cv() (i.e. v = 10) and also adding a repeats = 1 argument. I think over time people will come to expect that the group_* implementation is otherwise equivalent to the non-grouped version, except for the grouping. For group_vfold_cv() this currently isn't the case, where the default implementation is actually returning grouped loo_cv(). I realize LOO is really a special case of standard v-fold CV and users can opt in to vfold_cv()-like sampling, but its not currently the default which may cause mental friction.

I just think for consistency it makes sense for group_vfold_cv() to have a signature like:

group_vfold_cv(data, group, v = 10, repeats = 1, ...) {
  ... 
}

This would be a behavioural change that could throw a warning/message for some time and then eventually remove the warning/message. I think I just like the idea of group_* being otherwise equivalent sampling schemes to their non-group counterparts.

Thoughts?

The text was updated successfully, but these errors were encountered:

mikemahoney218 · 2022-06-30T16:40:07Z

I personally would be concerned about group_loo_cv() performing a different type of resampling than loo_cv(), and wouldn't want those functions to have similar names.

@juliasilge can say more, but I think a big part of why loo_cv() exists is to make it hard to use leave-one-out in things like tuning (see for instance the expected error in tune). That's because of continuing debate on the statistical properties of LOO.

Leave-one-group-out doesn't necessarily have the same concerns, and so doesn't need to subclass loo_cv like the other methods subclass their namesakes (#318). This method also has a lot of different names depending on what "group" is (for instance, leave-location-out, leave-time-out, leave-location-and-time-out). We call it spatial_leave_location_out_cv() in spatialsample because that's a domain-focused package, but maybe in rsample something along the lines of leave_one_group_out() would make sense? That breaks the naming scheme of group_*(), which I don't love, but distances the function from loo_cv(), which I think is necessary.

I'm going to defer to @juliasilge entirely on v -- I agree that the inconsistency is bad, but I don't know that I'd personally want to break anyone's code in order to fix it. The other caveat here is that v > unique(data$group) throws an error, so if you're trying to perform resampling with less than 10 groups in your data the default arguments wouldn't work.

I like the idea of adding repeats.

Thanks for the really thoughtful & clear issue!

mattwarkentin · 2022-06-30T16:47:01Z

Thanks for the detailed response.

I personally would be concerned about group_loo_cv() performing a different type of resampling than loo_cv(), and wouldn't want those functions to have similar names.

Could you elaborate on this? I am unclear on the different type of resampling you are referring to.

mikemahoney218 · 2022-06-30T16:55:31Z

So loo_cv() is performing leave-one-observation-out CV, which has some challenges because all of your training sets are going to be nearly 100% correlated with each other. This means it's not great to use LOO for tuning -- quoting from this paper (which is quoted here):

However, if model selection is involved, the performance of LOO worsens in variability as the model selection uncertainty gets higher due to large model space, small penalty coefficients and/or the use of data-driven penalty coefficients

Now when you're leaving a group out instead, you're performing leave-one-group-out CV, which has more separation between the training sets in the different folds and so produces more stable assessments for tuning. While it's similar to LOO (if every observation has a unique group, it's identical), in practice assuming that your groups are each a reasonable % of the data I'm not aware of the same concerns about statistical properties. We wouldn't need to block it from tune in the same way, as I understand it.

mikemahoney218 · 2022-06-30T17:00:23Z

This distinction is (I believe, I don't have insider knowledge or anything) why SciKit has a kfold and group kfold, a shuffle split and a group shuffle split, but a leave one out and a leave one group out.

mattwarkentin · 2022-06-30T17:04:41Z

I completely agree on the statistical issues with standard LOO.

Following from your SciKit sampling methods, I would say rsamples current default implementation of group_vfold_cv() is actually equivalent to their "leave one group out". And in my eyes, group_loo_cv() should actually be the implementation of leave one group out.

I would also say we do not currently have a default "group kfold" because our defaults for group_vfold_cv() are actually leave one group out.

Sorry, the terminology is getting a bit confusing.

mattwarkentin · 2022-06-30T17:11:03Z

Also, FWIW, I think it would be reasonable for the default to be v = 10 for group_vfold_cv() and still throw an error if there are not 10 unique groups with a suggestion to choose a smaller number. Could even provide how many distinct groups were detected for an informative message. If we go that route. The other option is to default to v = 10 but "fall back" to v = unique(data$group) with a warning/message to avoid breaking changes.

mikemahoney218 · 2022-06-30T17:17:17Z

Following from your SciKit sampling methods, I would say rsamples current default implementation of group_vfold_cv() is actually equivalent to their "leave one group out".

Yeah, completely agree there.

And in my eyes, group_loo_cv() should actually be the implementation of leave one group out.

This is the one where I have a concern. I think a leave_one_group_out_cv() (or a function with a similar name) would be a good improvement. I don't think we need to block the outputs from that function from tune, which means we don't want them to inherit(x, "loo_cv"), so I think it'd be best to steer away from the name group_loo_cv(). I also think there's some issues with conceptually linking the two, since group_loo_cv() isn't doing the "same thing" as loo_cv(), at least not in the way that group_mc_cv() etc is. With every other "pair" of functions, your assessment set should be ~the same size across the grouped and non-grouped functions, whereas that should almost never be the case with leave-one-group-out and leave-one-observation-out.

To be very clear: I'm not against a function with a more obvious name for doing leave one group out, I just think it shouldn't be called group_loo_cv().

Sorry, the terminology is getting a bit confusing.

Agreed 😆

The other option is to default to v = 10 but "fall back" to v = unique(data$group) with a warning/message to avoid breaking changes

We actually do this in spatialsample but decided to not here in #293 . The reasoning is that in spatialsample, you often don't know what the maximum number of groups is going to be, whereas here it's always in your data -- it's either nrow(data) or length(unique(data$group)).

As with the rest of the conversation around v, I'm going to defer to Julia.

mattwarkentin · 2022-06-30T17:23:19Z

I like the name leave_one_group_out_cv() and would be totally onboard with this, but it seems like its only really necessary if we follow through with changing the default behaviour of group_vfold_cv() to be aligned with vfold_cv(). Do you agree?

mikemahoney218 · 2022-06-30T17:24:55Z

Believe we're on the same page there. I'm not making the call about changing group_vfold_cv(), though -- I think @juliasilge is better situated to decide (1) how much we want to fix the inconsistency (2) if that justifies (what I think needs to be) a breaking change.

juliasilge · 2022-07-01T16:00:55Z

I agree with @mikemahoney218 that we don't want to use the class loo_cv or the words/name "leave one out" for this way of doing grouped CV. 👍

I don't want to jump in to changing that default for group_vfold_cv(), although in an ideal world the two functions would be more consistent. I'll open a new issue to gather feedback and thoughts on that.

In the meantime, @mattwarkentin would you like to do a PR to the docs for group_vfold_cv() to clarify that v = NULL is also called "leave-group-out CV"? We could add something like this to the examples as well:

library(rsample)
data(ames, package = "modeldata")

## also called "leave-group-out":
group_vfold_cv(ames, group = Neighborhood)
#> # Group 28-fold cross-validation 
#> # A tibble: 28 × 2
#>    splits             id        
#>    <list>             <chr>     
#>  1 <split [2882/48]>  Resample01
#>  2 <split [2765/165]> Resample02
#>  3 <split [2886/44]>  Resample03
#>  4 <split [2748/182]> Resample04
#>  5 <split [2900/30]>  Resample05
#>  6 <split [2805/125]> Resample06
#>  7 <split [2764/166]> Resample07
#>  8 <split [2663/267]> Resample08
#>  9 <split [2799/131]> Resample09
#> 10 <split [2691/239]> Resample10
#> # … with 18 more rows

^{Created on 2022-07-01 by the reprex package (v2.0.1)}

mattwarkentin · 2022-07-04T17:54:36Z

Okay, just catching up on this issue and the related issues/PRs. Everything sounds good and it sounds like we are all on the same page.

Are you still interested in a PR to update the documents in the ways described above?

mikemahoney218 · 2022-07-05T15:12:32Z

Are you still interested in a PR to update the documents in the ways described above?

If you have the time & interest, definitely!

github-actions · 2022-07-20T02:20:52Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

juliasilge mentioned this issue Jul 1, 2022

Make default for v consistent between vfold_cv() and group_vfold_cv() #328

Open

mattwarkentin mentioned this issue Jul 5, 2022

Update docs for group_vfold_cv() #339

Merged

juliasilge closed this as completed in #339 Jul 5, 2022

github-actions bot locked and limited conversation to collaborators Jul 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`group_loo_cv()` and `group_vfold_cv()` #324

`group_loo_cv()` and `group_vfold_cv()` #324

mattwarkentin commented Jun 30, 2022 •

edited

Loading

mikemahoney218 commented Jun 30, 2022 •

edited

Loading

mattwarkentin commented Jun 30, 2022

mikemahoney218 commented Jun 30, 2022

mikemahoney218 commented Jun 30, 2022

mattwarkentin commented Jun 30, 2022 •

edited

Loading

mattwarkentin commented Jun 30, 2022 •

edited

Loading

mikemahoney218 commented Jun 30, 2022 •

edited

Loading

mattwarkentin commented Jun 30, 2022

mikemahoney218 commented Jun 30, 2022

juliasilge commented Jul 1, 2022 •

edited

Loading

mattwarkentin commented Jul 4, 2022

mikemahoney218 commented Jul 5, 2022

github-actions bot commented Jul 20, 2022

group_loo_cv() and group_vfold_cv() #324

group_loo_cv() and group_vfold_cv() #324

Comments

mattwarkentin commented Jun 30, 2022 • edited Loading

mikemahoney218 commented Jun 30, 2022 • edited Loading

mattwarkentin commented Jun 30, 2022

mikemahoney218 commented Jun 30, 2022

mikemahoney218 commented Jun 30, 2022

mattwarkentin commented Jun 30, 2022 • edited Loading

mattwarkentin commented Jun 30, 2022 • edited Loading

mikemahoney218 commented Jun 30, 2022 • edited Loading

mattwarkentin commented Jun 30, 2022

mikemahoney218 commented Jun 30, 2022

juliasilge commented Jul 1, 2022 • edited Loading

mattwarkentin commented Jul 4, 2022

mikemahoney218 commented Jul 5, 2022

github-actions bot commented Jul 20, 2022

`group_loo_cv()` and `group_vfold_cv()` #324

`group_loo_cv()` and `group_vfold_cv()` #324

mattwarkentin commented Jun 30, 2022 •

edited

Loading

mikemahoney218 commented Jun 30, 2022 •

edited

Loading

mattwarkentin commented Jun 30, 2022 •

edited

Loading

mattwarkentin commented Jun 30, 2022 •

edited

Loading

mikemahoney218 commented Jun 30, 2022 •

edited

Loading

juliasilge commented Jul 1, 2022 •

edited

Loading