Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_dupes fails when called on grouping variables #329

Closed
sfirke opened this issue Feb 2, 2020 · 16 comments
Closed

get_dupes fails when called on grouping variables #329

sfirke opened this issue Feb 2, 2020 · 16 comments
Assignees
Milestone

Comments

@sfirke
Copy link
Owner

sfirke commented Feb 2, 2020

A rare case, but yes, it happened to me in actual analysis this week and was slightly annoying.

> mtcars %>% group_by(cyl) %>% get_dupes(cyl)
Error: Column `cyl` can't be modified because it's a grouping variable

I think get_dupes() should check if a data.frame is grouped and then ungroup it if so before proceeding. I don't see a meaningful way in which a user would expect grouping to have a practical effect and anyway, it doesn't so they'd be mislead.

mtcars %>% group_by(cyl) %>% get_dupes(mpg) is okay, I think the grouping is meaningless. In short, get_dupes() is an interactive function and it shouldn't choke for this unrelated reason.

@jzadra
Copy link
Contributor

jzadra commented Feb 20, 2020

I'll take this one.

@billdenney
Copy link
Collaborator

I think that an ungrouping warning should be given just in case the user was expecting something else (like get_dupes() to be run per group level).

@jzadra
Copy link
Contributor

jzadra commented Feb 20, 2020

I think that an ungrouping warning should be given just in case the user was expecting something else (like get_dupes() to be run per group level).

To me this seems like a slightly separate issue - running get_dupes on given vars that are NOT the grouping vars, the user might expect get_dupes to only consider within each group. However Sam's issue was more about calling it on a grouping variable itself, in which case there is no need to consider the grouping variable. Of course doing a combination of the two requires some consideration.

@jzadra
Copy link
Contributor

jzadra commented Feb 20, 2020

But to fix the failure, something like this would do the trick:

x <- mtcars %>% group_by(cyl, gear)



if (is_grouped_df(x)) {
  
xgroups <- group_vars(x)

x <- x %>% ungroup()

}
  
## standard get dupes function here ##

if (exists("xgroups")) {
x %>% 
  group_by(!!!syms(xgroups))

  }

@billdenney
Copy link
Collaborator

Ah, got it. I agree that what I mentioned would be a different issue.

@sfirke
Copy link
Owner Author

sfirke commented Feb 20, 2020 via email

@sfirke
Copy link
Owner Author

sfirke commented Feb 20, 2020

Hm, you are re-grouping it after get_dupes() runs? What purpose do you see in having the result be grouped? I figured it could be even simpler, basically if the input is grouped then ungroup it and run as-is. I'm unable to see a scenario where you'd expect the result of get_dupes() to be grouped and it feels more consistent to me for it not to be grouped regardless of the input.

@sfirke
Copy link
Owner Author

sfirke commented Feb 20, 2020

Similar to #125 and #97 which were fixed with simple ungroup().

@jzadra
Copy link
Contributor

jzadra commented Feb 20, 2020

In my opinion, with tabyl functions it makes sense to drop the grouping because the data is getting collapsed and there are no further operations the user would do on the tabyl that involve the grouping structure (same as how the summarize() family drops a level of grouping due to collapsing).

For get_dupes() however, it seems to me that it would be better honor the users desire to have their data grouped. You put grouped in, you get grouped out. Otherwise having a grouped data frame suddenly get ungrouped when it isn't an explicit desire and when it can be avoided would be an unexpected behavior. They may want to get the dupes and then continue on with operations that they are expecting to operated within the grouping structure.

Is there a drawback to returning the data in the same grouping structure that I'm not considering?

@sfirke
Copy link
Owner Author

sfirke commented Mar 7, 2020

Hm I see, so it would ungroup for the duplicate counting -- the grouping would be meaningless in that context -- but then the groups get reassigned in case their process is to continue an analysis pipeline? That's fine with me. I don't see one as preferable, so let's go with what you suggest - I just want it to not fail on a grouped df 😀

@sfirke sfirke added this to the v2.0 milestone Mar 7, 2020
@jzadra
Copy link
Contributor

jzadra commented Mar 10, 2020

Yes that's all there is to it in my above suggestion. That fixes the error for now.

I guess the outstanding question is whether the user might expect get_dupes to work on grouped data in an explicit way, ie if there is a duplicate but it's in another group, does it get counted as a duplicate? How do we want to handle within-group duplicates vs across-group duplicates?

Do we want to deal with this all at once, or just fix the error for now and deal with the other part as a separate issue? Sorry for the delay, still trying to figure out how my fork / PR got so discombobulated.

@sfirke
Copy link
Owner Author

sfirke commented Mar 12, 2020

Let's just fix the error for now and get it merged in to janitor 2.0.0. I can't think of expectations for use on grouped data but if folks have ideas we can keep discussing as a new issue.

And if your Git situation is messed up, I would suggest the "burn it down" move as Jenny Bryan calls it. Copy the files somewhere safe, delete the repo, re-fork and proceed anew, copying the desired files back in.

Can you send that this week with a test & an update to NEWS? Then we can get it merged in before this goes to CRAN. No worries if not, just wanted to be transparent with (best-case timeline.

@jzadra
Copy link
Contributor

jzadra commented Mar 12, 2020 via email

@jzadra
Copy link
Contributor

jzadra commented Mar 12, 2020

Should we include a message that appears when grouped data is provided along the lines of:

"Data is grouped. Note that get_dupes() is not group aware and does not limit duplicate detection to within-groups, but rather checks the entire data frame. However grouping structure is preserved."

Feel free to suggest edits if we do want to include a message.

@jzadra
Copy link
Contributor

jzadra commented Mar 12, 2020

I went ahead and added a message that gets displayed only once per session. Happy to remove or edit if you all deem it unnecessary or clunky.

@sfirke
Copy link
Owner Author

sfirke commented Mar 16, 2020

closed by #345

@sfirke sfirke closed this as completed Mar 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants