-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16038: [R] different behavior from dplyr when mutate's .keep
option is set
#12818
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this. The behavior + history on the dplyr side is a lot. I have a few relatively minor comments and suggestions
r/R/dplyr-mutate.R
Outdated
## keeping with: https://github.com/tidyverse/dplyr/issues/6086 | ||
cur_exprs <- vapply(dots, as_label, character(1)) | ||
new_col_names <- names(dots) | ||
transmute_order <- ifelse(nzchar(new_col_names), new_col_names, cur_exprs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a bit of trouble following what this line is doing here — what behavior are we trying to catch here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very open to other approaches but the dplyr implementation of transmute
return the columns in the order they are given in the function call which is now distinct from mutate(..., .keep = "none")
. So this was what i came up with to preserve that order. But there possibly could be a better way to capture order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nods sorry I should have been a bit more specific — I didn't totally follow what might cause the names of new_col_names
to be empty (I imagine that's what happens with any current columns at this point, given this ifelse()
: so here cur_exprs
has names if it's a new column, but does not if it doesn't, yeah?), but that isn't super super obvious on first read. And actually, now that I'm looking at it, maybe the intention is to replace any thing in dots
that doesn't have a name with what's in cur_expers
?
Would it be possible to do something like:
cur_exprs <- map_chr(dots, as_label)
transmute_order <- names(dots)
transmute_order[nzchar(transmute_order)] <- cur_exprs[nzchar(transmute_order)]
or even something like purrr::modify
or purrr::modify_if
? https://purrr.tidyverse.org/reference/modify.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the intention is to replace any thing in dots that doesn't have a name with what's in cur_expers
Yes you are exactly right. I just realized that because map_chr
returns a named vector I can equally get the names from names(cur_exprs)
. Also coalesce
might work here too. What about this?
cur_exprs <- map_chr(dots, as_label)
transmute_order <- dplyr::coalesce(na_if(names(cur_exprs), ""), cur_exprs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dplyr::coalesce
works here (and since this is a dplyr function, we can be pretty sure dplyr is already installed), but generally we try to shy away from using dplyr as a dependency.
So I would say let's get the names from cur_exprs
but then do the more base transmute_order[is.na(names(...))] <- ...
assignment style. IMO it also is slightly more explicit about what we're doing (replacing anything that doesn't have a name with a name-like representation of itself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic! Thanks for this
Benchmark runs are scheduled for baseline = f0b5c49 and contender = b829943. b829943 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…option is set This PR does two things to match some dplyr behaviour around column order: 1) Mimics dplyr implementation of `mutate(..., .keep = "none")` to append new columns after the existing columns (if suggested) as [per](tidyverse/dplyr#6086) 2) As per this [discussion](tidyverse/dplyr#6086), this required a bespoke approach to `transmute` as it not simply a wrapper for `mutate(..., .keep = "none")`. This cascades into needing to catch a couple edge cases. I have also added some tests which will test for this behaviour. Closes apache#12818 from boshek/mutate-keep Authored-by: SAm Albers <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>
This PR does two things to match some dplyr behaviour around column order:
Mimics dplyr implementation of
mutate(..., .keep = "none")
to append new columns after the existing columns (if suggested) as perAs per this discussion, this required a bespoke approach to
transmute
as it not simply a wrapper formutate(..., .keep = "none")
. This cascades into needing to catch a couple edge cases.I have also added some tests which will test for this behaviour.