Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add group_initial_split() and group_validation_split() #315

Merged
merged 4 commits into from
Jun 29, 2022

Conversation

mikemahoney218
Copy link
Member

This PR adds group_initial_split() and group_validation_split(). These are documented as part of initial_split() and validation_split(), and don't have their own classes, in order to match the time-based methods. We might consider doing the same for group_vfold_cv() and group_mc_cv() (or, possibly changing these to use their own classes).

It also fixes a bug where group_mc_cv(..., times = 1) would never produce an assessment set, because all data would be assigned to the same fold.

library(rsample)

set.seed(2)
dat1 <- data.frame(a = 1:20, b = letters[1:20], c = rep(1:4, 5))

group_initial_split(dat1, c) |> 
  testing()
#>     a b c
#> 1   1 a 1
#> 5   5 e 1
#> 9   9 i 1
#> 13 13 m 1
#> 17 17 q 1

group_initial_split(dat1, c) |> 
  testing()
#>     a b c
#> 2   2 b 2
#> 6   6 f 2
#> 10 10 j 2
#> 14 14 n 2
#> 18 18 r 2

group_validation_split(dat1, c)$splits[[1]] |> 
  assessment()
#>     a b c
#> 4   4 d 4
#> 8   8 h 4
#> 12 12 l 4
#> 16 16 p 4
#> 20 20 t 4

group_validation_split(dat1, c)$splits[[1]] |> 
  assessment()
#>     a b c
#> 3   3 c 3
#> 7   7 g 3
#> 11 11 k 3
#> 15 15 o 3
#> 19 19 s 3

Created on 2022-06-28 by the reprex package (v2.0.1)

@mikemahoney218 mikemahoney218 marked this pull request as ready for review June 28, 2022 18:54
@juliasilge
Copy link
Member

juliasilge commented Jun 29, 2022

Can you open another issue to outline the inconsistency in the classes? And walk through which ones could be addressed?

The way names have ended up is a bit of a bummer. What is the best option?

  • initial_split() + initial_time_split() + group_initial_split()? This one, not creating an rset but a split, I could see making consistent and all starting with initial_*().
  • validation_split() + validation_time_split() + group_validation_split()? For this one, it definitely makes more sense to have the new function start with group_*() because it belongs with all the other group_*() rset functions. Should we change the name of validation_time_split()? Or just leave it?

@mikemahoney218
Copy link
Member Author

Opened #318 for classes.

As for names: I personally don't hate that the names are inconsistent for these two. If we had more time-based functions, I think I'd rather prefix with time_ (so change validation_time_split() so it matches sliding_*, group_*, int_*), but I don't know that it's necessary here (especially since these aren't really time-based, so much as "number of rows" based). Not the strongest opinion I've ever had, but I think I would rather the initial_split/validation_split derivatives be inconsistent, instead of the group_* derivatives.

@juliasilge
Copy link
Member

OK, sounds good on the naming! 👍

Copy link
Member

@juliasilge juliasilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So great! 🚀

@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants