get_dupes fails when called on grouping variables #329

sfirke · 2020-02-02T01:32:21Z

A rare case, but yes, it happened to me in actual analysis this week and was slightly annoying.

> mtcars %>% group_by(cyl) %>% get_dupes(cyl)
Error: Column `cyl` can't be modified because it's a grouping variable

I think get_dupes() should check if a data.frame is grouped and then ungroup it if so before proceeding. I don't see a meaningful way in which a user would expect grouping to have a practical effect and anyway, it doesn't so they'd be mislead.

mtcars %>% group_by(cyl) %>% get_dupes(mpg) is okay, I think the grouping is meaningless. In short, get_dupes() is an interactive function and it shouldn't choke for this unrelated reason.

The text was updated successfully, but these errors were encountered:

jzadra · 2020-02-20T00:28:15Z

I'll take this one.

billdenney · 2020-02-20T00:39:24Z

I think that an ungrouping warning should be given just in case the user was expecting something else (like get_dupes() to be run per group level).

jzadra · 2020-02-20T00:50:37Z

I think that an ungrouping warning should be given just in case the user was expecting something else (like get_dupes() to be run per group level).

To me this seems like a slightly separate issue - running get_dupes on given vars that are NOT the grouping vars, the user might expect get_dupes to only consider within each group. However Sam's issue was more about calling it on a grouping variable itself, in which case there is no need to consider the grouping variable. Of course doing a combination of the two requires some consideration.

jzadra · 2020-02-20T00:51:14Z

But to fix the failure, something like this would do the trick:

x <- mtcars %>% group_by(cyl, gear)



if (is_grouped_df(x)) {
  
xgroups <- group_vars(x)

x <- x %>% ungroup()

}
  
## standard get dupes function here ##

if (exists("xgroups")) {
x %>% 
  group_by(!!!syms(xgroups))

  }

billdenney · 2020-02-20T01:01:10Z

Ah, got it. I agree that what I mentioned would be a different issue.

sfirke · 2020-02-20T01:12:41Z

Yep I was just talking about the simple failure when it is called on a grouping variable. Let's go ahead with that. Anyhow, after thinking some, I don't know what other behavior a user might expect the grouping attribute to cause.

…

On Wed, Feb 19, 2020, 8:01 PM Bill Denney ***@***.***> wrote: Ah, got it. I agree that what I mentioned would be a different issue. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#329?email_source=notifications&email_token=ABZYDED42TNBWWM6LQE2FGTRDXI5PA5CNFSM4KOWWPXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKKNCQ#issuecomment-588555914>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZYDEDZ3RD6OOCUT5EZVUDRDXI5PANCNFSM4KOWWPXA> .

sfirke · 2020-02-20T18:21:48Z

Hm, you are re-grouping it after get_dupes() runs? What purpose do you see in having the result be grouped? I figured it could be even simpler, basically if the input is grouped then ungroup it and run as-is. I'm unable to see a scenario where you'd expect the result of get_dupes() to be grouped and it feels more consistent to me for it not to be grouped regardless of the input.

sfirke · 2020-02-20T18:23:46Z

Similar to #125 and #97 which were fixed with simple ungroup().

jzadra · 2020-02-20T19:32:51Z

In my opinion, with tabyl functions it makes sense to drop the grouping because the data is getting collapsed and there are no further operations the user would do on the tabyl that involve the grouping structure (same as how the summarize() family drops a level of grouping due to collapsing).

For get_dupes() however, it seems to me that it would be better honor the users desire to have their data grouped. You put grouped in, you get grouped out. Otherwise having a grouped data frame suddenly get ungrouped when it isn't an explicit desire and when it can be avoided would be an unexpected behavior. They may want to get the dupes and then continue on with operations that they are expecting to operated within the grouping structure.

Is there a drawback to returning the data in the same grouping structure that I'm not considering?

sfirke · 2020-03-07T04:52:43Z

Hm I see, so it would ungroup for the duplicate counting -- the grouping would be meaningless in that context -- but then the groups get reassigned in case their process is to continue an analysis pipeline? That's fine with me. I don't see one as preferable, so let's go with what you suggest - I just want it to not fail on a grouped df 😀

jzadra · 2020-03-10T18:51:52Z

Yes that's all there is to it in my above suggestion. That fixes the error for now.

I guess the outstanding question is whether the user might expect get_dupes to work on grouped data in an explicit way, ie if there is a duplicate but it's in another group, does it get counted as a duplicate? How do we want to handle within-group duplicates vs across-group duplicates?

Do we want to deal with this all at once, or just fix the error for now and deal with the other part as a separate issue? Sorry for the delay, still trying to figure out how my fork / PR got so discombobulated.

sfirke · 2020-03-12T02:48:36Z

Let's just fix the error for now and get it merged in to janitor 2.0.0. I can't think of expectations for use on grouped data but if folks have ideas we can keep discussing as a new issue.

And if your Git situation is messed up, I would suggest the "burn it down" move as Jenny Bryan calls it. Copy the files somewhere safe, delete the repo, re-fork and proceed anew, copying the desired files back in.

Can you send that this week with a test & an update to NEWS? Then we can get it merged in before this goes to CRAN. No worries if not, just wanted to be transparent with (best-case timeline.

jzadra · 2020-03-12T03:08:04Z

Sounds good. Yes I will work on it tomorrow! Thanks, Jon

…

-- Jonathan Zadra, PhD (he/him) Director of Data Science Sorenson Impact Center David Eccles School of Business, University of Utah www.sorensonimpact.com (801) 581-4815

On Mar 11, 2020, 20:48 -0600, Sam Firke ***@***.***>, wrote: Let's just fix the error for now and get it merged in to janitor 2.0.0. I can't think of expectations for use on grouped data but if folks have ideas we can keep discussing as a new issue. And if your Git situation is messed up, I would suggest the "burn it down" move as Jenny Bryan calls it. Copy the files somewhere safe, delete the repo, re-fork and proceed anew, copying the desired files back in. Can you send that this week with a test & an update to NEWS? Then we can get it merged in before this goes to CRAN. No worries if not, just wanted to be transparent with (best-case timeline. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

--

jzadra · 2020-03-12T22:08:14Z

Should we include a message that appears when grouped data is provided along the lines of:

"Data is grouped. Note that get_dupes() is not group aware and does not limit duplicate detection to within-groups, but rather checks the entire data frame. However grouping structure is preserved."

Feel free to suggest edits if we do want to include a message.

jzadra · 2020-03-12T23:03:59Z

I went ahead and added a message that gets displayed only once per session. Happy to remove or edit if you all deem it unnecessary or clunky.

sfirke · 2020-03-16T02:16:38Z

closed by #345

sfirke assigned jzadra Feb 20, 2020

sfirke added bug good first issue labels Feb 20, 2020

sfirke added this to the v2.0 milestone Mar 7, 2020

jzadra mentioned this issue Mar 12, 2020

get_dupes() grouped data error fix #345

Merged

sfirke closed this as completed Mar 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_dupes fails when called on grouping variables #329

get_dupes fails when called on grouping variables #329

sfirke commented Feb 2, 2020

jzadra commented Feb 20, 2020

billdenney commented Feb 20, 2020

jzadra commented Feb 20, 2020

jzadra commented Feb 20, 2020 •

edited

Loading

billdenney commented Feb 20, 2020

sfirke commented Feb 20, 2020 via email

sfirke commented Feb 20, 2020

sfirke commented Feb 20, 2020

jzadra commented Feb 20, 2020

sfirke commented Mar 7, 2020

jzadra commented Mar 10, 2020

sfirke commented Mar 12, 2020

jzadra commented Mar 12, 2020 via email

jzadra commented Mar 12, 2020

jzadra commented Mar 12, 2020

sfirke commented Mar 16, 2020

get_dupes fails when called on grouping variables #329

get_dupes fails when called on grouping variables #329

Comments

sfirke commented Feb 2, 2020

jzadra commented Feb 20, 2020

billdenney commented Feb 20, 2020

jzadra commented Feb 20, 2020

jzadra commented Feb 20, 2020 • edited Loading

billdenney commented Feb 20, 2020

sfirke commented Feb 20, 2020 via email

sfirke commented Feb 20, 2020

sfirke commented Feb 20, 2020

jzadra commented Feb 20, 2020

sfirke commented Mar 7, 2020

jzadra commented Mar 10, 2020

sfirke commented Mar 12, 2020

jzadra commented Mar 12, 2020 via email

jzadra commented Mar 12, 2020

jzadra commented Mar 12, 2020

sfirke commented Mar 16, 2020

jzadra commented Feb 20, 2020 •

edited

Loading