Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_loo_cv() and group_vfold_cv() #324

Closed
mattwarkentin opened this issue Jun 30, 2022 · 13 comments · Fixed by #339
Closed

group_loo_cv() and group_vfold_cv() #324

mattwarkentin opened this issue Jun 30, 2022 · 13 comments · Fixed by #339

Comments

@mattwarkentin
Copy link
Contributor

mattwarkentin commented Jun 30, 2022

Hi @mikemahoney218 and @juliasilge,

Super excited for all of the great work @mikemahoney218 has been doing to boost {rsample} group sampling support.

I was thinking about this last night...it seems to me that the current default implementation of group_vfold_cv() is really the group_* counterpart to loo_cv(). With v = NULL (the default for group_vfold_cv()), you get group LOO-CV sampling. Over time this may be confusing as users may grow to expect the group_* version of something to return a sampling pattern similar to the default for its non-grouped sibling.

I am wondering if it makes sense to make the following changes to achieve a logically consistent API for grouped and non-grouped pairs:

  • Create group_loo_cv() based on the current default implementation of group_vfold_cv(v = NULL)

With no changes to group_vfold_cv(), this could be as simple as:

group_loo_cv <- function(data, group, ...) {
  group_vfold_cv(data, group, v = NULL, ...)
}

That way users will find a group_* implementation of all relevant standard sampling methods which I think helps the mental model for the grouped verisons.

  • However, I wonder if its worth possibly changing the default value of v for group_vfold_cv() to be equivalent to that of vfold_cv() (i.e. v = 10) and also adding a repeats = 1 argument. I think over time people will come to expect that the group_* implementation is otherwise equivalent to the non-grouped version, except for the grouping. For group_vfold_cv() this currently isn't the case, where the default implementation is actually returning grouped loo_cv(). I realize LOO is really a special case of standard v-fold CV and users can opt in to vfold_cv()-like sampling, but its not currently the default which may cause mental friction.

I just think for consistency it makes sense for group_vfold_cv() to have a signature like:

group_vfold_cv(data, group, v = 10, repeats = 1, ...) {
  ... 
}

This would be a behavioural change that could throw a warning/message for some time and then eventually remove the warning/message. I think I just like the idea of group_* being otherwise equivalent sampling schemes to their non-group counterparts.

Thoughts?

@mikemahoney218
Copy link
Member

mikemahoney218 commented Jun 30, 2022

I personally would be concerned about group_loo_cv() performing a different type of resampling than loo_cv(), and wouldn't want those functions to have similar names.

@juliasilge can say more, but I think a big part of why loo_cv() exists is to make it hard to use leave-one-out in things like tuning (see for instance the expected error in tune). That's because of continuing debate on the statistical properties of LOO.

Leave-one-group-out doesn't necessarily have the same concerns, and so doesn't need to subclass loo_cv like the other methods subclass their namesakes (#318). This method also has a lot of different names depending on what "group" is (for instance, leave-location-out, leave-time-out, leave-location-and-time-out). We call it spatial_leave_location_out_cv() in spatialsample because that's a domain-focused package, but maybe in rsample something along the lines of leave_one_group_out() would make sense? That breaks the naming scheme of group_*(), which I don't love, but distances the function from loo_cv(), which I think is necessary.

I'm going to defer to @juliasilge entirely on v -- I agree that the inconsistency is bad, but I don't know that I'd personally want to break anyone's code in order to fix it. The other caveat here is that v > unique(data$group) throws an error, so if you're trying to perform resampling with less than 10 groups in your data the default arguments wouldn't work.

I like the idea of adding repeats.

Thanks for the really thoughtful & clear issue!

@mattwarkentin
Copy link
Contributor Author

Thanks for the detailed response.

I personally would be concerned about group_loo_cv() performing a different type of resampling than loo_cv(), and wouldn't want those functions to have similar names.

Could you elaborate on this? I am unclear on the different type of resampling you are referring to.

@mikemahoney218
Copy link
Member

So loo_cv() is performing leave-one-observation-out CV, which has some challenges because all of your training sets are going to be nearly 100% correlated with each other. This means it's not great to use LOO for tuning -- quoting from this paper (which is quoted here):

However, if model selection is involved, the performance of LOO worsens in variability as the model selection uncertainty gets higher due to large model space, small penalty coefficients and/or the use of data-driven penalty coefficients

Now when you're leaving a group out instead, you're performing leave-one-group-out CV, which has more separation between the training sets in the different folds and so produces more stable assessments for tuning. While it's similar to LOO (if every observation has a unique group, it's identical), in practice assuming that your groups are each a reasonable % of the data I'm not aware of the same concerns about statistical properties. We wouldn't need to block it from tune in the same way, as I understand it.

@mikemahoney218
Copy link
Member

This distinction is (I believe, I don't have insider knowledge or anything) why SciKit has a kfold and group kfold, a shuffle split and a group shuffle split, but a leave one out and a leave one group out.

@mattwarkentin
Copy link
Contributor Author

mattwarkentin commented Jun 30, 2022

I completely agree on the statistical issues with standard LOO.

Following from your SciKit sampling methods, I would say rsamples current default implementation of group_vfold_cv() is actually equivalent to their "leave one group out". And in my eyes, group_loo_cv() should actually be the implementation of leave one group out.

I would also say we do not currently have a default "group kfold" because our defaults for group_vfold_cv() are actually leave one group out.

Sorry, the terminology is getting a bit confusing.

@mattwarkentin
Copy link
Contributor Author

mattwarkentin commented Jun 30, 2022

Also, FWIW, I think it would be reasonable for the default to be v = 10 for group_vfold_cv() and still throw an error if there are not 10 unique groups with a suggestion to choose a smaller number. Could even provide how many distinct groups were detected for an informative message. If we go that route. The other option is to default to v = 10 but "fall back" to v = unique(data$group) with a warning/message to avoid breaking changes.

@mikemahoney218
Copy link
Member

mikemahoney218 commented Jun 30, 2022

Following from your SciKit sampling methods, I would say rsamples current default implementation of group_vfold_cv() is actually equivalent to their "leave one group out".

Yeah, completely agree there.

And in my eyes, group_loo_cv() should actually be the implementation of leave one group out.

This is the one where I have a concern. I think a leave_one_group_out_cv() (or a function with a similar name) would be a good improvement. I don't think we need to block the outputs from that function from tune, which means we don't want them to inherit(x, "loo_cv"), so I think it'd be best to steer away from the name group_loo_cv(). I also think there's some issues with conceptually linking the two, since group_loo_cv() isn't doing the "same thing" as loo_cv(), at least not in the way that group_mc_cv() etc is. With every other "pair" of functions, your assessment set should be ~the same size across the grouped and non-grouped functions, whereas that should almost never be the case with leave-one-group-out and leave-one-observation-out.

To be very clear: I'm not against a function with a more obvious name for doing leave one group out, I just think it shouldn't be called group_loo_cv().

Sorry, the terminology is getting a bit confusing.

Agreed 😆

The other option is to default to v = 10 but "fall back" to v = unique(data$group) with a warning/message to avoid breaking changes

We actually do this in spatialsample but decided to not here in #293 . The reasoning is that in spatialsample, you often don't know what the maximum number of groups is going to be, whereas here it's always in your data -- it's either nrow(data) or length(unique(data$group)).

As with the rest of the conversation around v, I'm going to defer to Julia.

@mattwarkentin
Copy link
Contributor Author

I like the name leave_one_group_out_cv() and would be totally onboard with this, but it seems like its only really necessary if we follow through with changing the default behaviour of group_vfold_cv() to be aligned with vfold_cv(). Do you agree?

@mikemahoney218
Copy link
Member

Believe we're on the same page there. I'm not making the call about changing group_vfold_cv(), though -- I think @juliasilge is better situated to decide (1) how much we want to fix the inconsistency (2) if that justifies (what I think needs to be) a breaking change.

@juliasilge
Copy link
Member

juliasilge commented Jul 1, 2022

I agree with @mikemahoney218 that we don't want to use the class loo_cv or the words/name "leave one out" for this way of doing grouped CV. 👍

I don't want to jump in to changing that default for group_vfold_cv(), although in an ideal world the two functions would be more consistent. I'll open a new issue to gather feedback and thoughts on that.

In the meantime, @mattwarkentin would you like to do a PR to the docs for group_vfold_cv() to clarify that v = NULL is also called "leave-group-out CV"? We could add something like this to the examples as well:

library(rsample)
data(ames, package = "modeldata")

## also called "leave-group-out":
group_vfold_cv(ames, group = Neighborhood)
#> # Group 28-fold cross-validation 
#> # A tibble: 28 × 2
#>    splits             id        
#>    <list>             <chr>     
#>  1 <split [2882/48]>  Resample01
#>  2 <split [2765/165]> Resample02
#>  3 <split [2886/44]>  Resample03
#>  4 <split [2748/182]> Resample04
#>  5 <split [2900/30]>  Resample05
#>  6 <split [2805/125]> Resample06
#>  7 <split [2764/166]> Resample07
#>  8 <split [2663/267]> Resample08
#>  9 <split [2799/131]> Resample09
#> 10 <split [2691/239]> Resample10
#> # … with 18 more rows

Created on 2022-07-01 by the reprex package (v2.0.1)

@mattwarkentin
Copy link
Contributor Author

Okay, just catching up on this issue and the related issues/PRs. Everything sounds good and it sounds like we are all on the same page.

Are you still interested in a PR to update the documents in the ways described above?

@mikemahoney218
Copy link
Member

Are you still interested in a PR to update the documents in the ways described above?

If you have the time & interest, definitely!

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants