-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_dupes fails when called on grouping variables #329
Comments
I'll take this one. |
I think that an ungrouping warning should be given just in case the user was expecting something else (like |
To me this seems like a slightly separate issue - running get_dupes on given vars that are NOT the grouping vars, the user might expect get_dupes to only consider within each group. However Sam's issue was more about calling it on a grouping variable itself, in which case there is no need to consider the grouping variable. Of course doing a combination of the two requires some consideration. |
But to fix the failure, something like this would do the trick:
|
Ah, got it. I agree that what I mentioned would be a different issue. |
Yep I was just talking about the simple failure when it is called on a
grouping variable. Let's go ahead with that.
Anyhow, after thinking some, I don't know what other behavior a user might
expect the grouping attribute to cause.
…On Wed, Feb 19, 2020, 8:01 PM Bill Denney ***@***.***> wrote:
Ah, got it. I agree that what I mentioned would be a different issue.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#329?email_source=notifications&email_token=ABZYDED42TNBWWM6LQE2FGTRDXI5PA5CNFSM4KOWWPXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKKNCQ#issuecomment-588555914>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABZYDEDZ3RD6OOCUT5EZVUDRDXI5PANCNFSM4KOWWPXA>
.
|
Hm, you are re-grouping it after |
In my opinion, with For Is there a drawback to returning the data in the same grouping structure that I'm not considering? |
Hm I see, so it would ungroup for the duplicate counting -- the grouping would be meaningless in that context -- but then the groups get reassigned in case their process is to continue an analysis pipeline? That's fine with me. I don't see one as preferable, so let's go with what you suggest - I just want it to not fail on a grouped df 😀 |
Yes that's all there is to it in my above suggestion. That fixes the error for now. I guess the outstanding question is whether the user might expect get_dupes to work on grouped data in an explicit way, ie if there is a duplicate but it's in another group, does it get counted as a duplicate? How do we want to handle within-group duplicates vs across-group duplicates? Do we want to deal with this all at once, or just fix the error for now and deal with the other part as a separate issue? Sorry for the delay, still trying to figure out how my fork / PR got so discombobulated. |
Let's just fix the error for now and get it merged in to janitor 2.0.0. I can't think of expectations for use on grouped data but if folks have ideas we can keep discussing as a new issue. And if your Git situation is messed up, I would suggest the "burn it down" move as Jenny Bryan calls it. Copy the files somewhere safe, delete the repo, re-fork and proceed anew, copying the desired files back in. Can you send that this week with a test & an update to NEWS? Then we can get it merged in before this goes to CRAN. No worries if not, just wanted to be transparent with (best-case timeline. |
Sounds good. Yes I will work on it tomorrow!
Thanks,
Jon
…--
Jonathan Zadra, PhD (he/him)
Director of Data Science
Sorenson Impact Center
David Eccles School of Business, University of Utah
www.sorensonimpact.com
(801) 581-4815
On Mar 11, 2020, 20:48 -0600, Sam Firke ***@***.***>, wrote:
Let's just fix the error for now and get it merged in to janitor 2.0.0. I can't think of expectations for use on grouped data but if folks have ideas we can keep discussing as a new issue.
And if your Git situation is messed up, I would suggest the "burn it down" move as Jenny Bryan calls it. Copy the files somewhere safe, delete the repo, re-fork and proceed anew, copying the desired files back in.
Can you send that this week with a test & an update to NEWS? Then we can get it merged in before this goes to CRAN. No worries if not, just wanted to be transparent with (best-case timeline.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or unsubscribe.
--
|
Should we include a message that appears when grouped data is provided along the lines of: "Data is grouped. Note that get_dupes() is not group aware and does not limit duplicate detection to within-groups, but rather checks the entire data frame. However grouping structure is preserved." Feel free to suggest edits if we do want to include a message. |
I went ahead and added a message that gets displayed only once per session. Happy to remove or edit if you all deem it unnecessary or clunky. |
closed by #345 |
A rare case, but yes, it happened to me in actual analysis this week and was slightly annoying.
I think
get_dupes()
should check if a data.frame is grouped and then ungroup it if so before proceeding. I don't see a meaningful way in which a user would expect grouping to have a practical effect and anyway, it doesn't so they'd be mislead.mtcars %>% group_by(cyl) %>% get_dupes(mpg)
is okay, I think the grouping is meaningless. In short,get_dupes()
is an interactive function and it shouldn't choke for this unrelated reason.The text was updated successfully, but these errors were encountered: