ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set #12818

boshek · 2022-04-06T20:44:47Z

This PR does two things to match some dplyr behaviour around column order:

Mimics dplyr implementation of mutate(..., .keep = "none") to append new columns after the existing columns (if suggested) as per
As per this discussion, this required a bespoke approach to transmute as it not simply a wrapper for mutate(..., .keep = "none"). This cascades into needing to catch a couple edge cases.

I have also added some tests which will test for this behaviour.

github-actions · 2022-04-06T20:45:09Z

https://issues.apache.org/jira/browse/ARROW-16038

github-actions · 2022-04-06T20:45:10Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

jonkeane

Thanks for digging into this. The behavior + history on the dplyr side is a lot. I have a few relatively minor comments and suggestions

r/R/dplyr-mutate.R

r/tests/testthat/test-dplyr-mutate.R

r/R/dplyr-mutate.R

jonkeane · 2022-04-06T21:37:37Z

r/R/dplyr-mutate.R

+  ## keeping with: https://github.com/tidyverse/dplyr/issues/6086
+  cur_exprs <- vapply(dots, as_label, character(1))
+  new_col_names <- names(dots)
+  transmute_order <- ifelse(nzchar(new_col_names), new_col_names, cur_exprs)


I'm having a bit of trouble following what this line is doing here — what behavior are we trying to catch here?

Very open to other approaches but the dplyr implementation of transmute return the columns in the order they are given in the function call which is now distinct from mutate(..., .keep = "none"). So this was what i came up with to preserve that order. But there possibly could be a better way to capture order?

nods sorry I should have been a bit more specific — I didn't totally follow what might cause the names of new_col_names to be empty (I imagine that's what happens with any current columns at this point, given this ifelse(): so here cur_exprs has names if it's a new column, but does not if it doesn't, yeah?), but that isn't super super obvious on first read. And actually, now that I'm looking at it, maybe the intention is to replace any thing in dots that doesn't have a name with what's in cur_expers?

Would it be possible to do something like:

cur_exprs <- map_chr(dots, as_label) transmute_order <- names(dots) transmute_order[nzchar(transmute_order)] <- cur_exprs[nzchar(transmute_order)]

or even something like purrr::modify or purrr::modify_if? https://purrr.tidyverse.org/reference/modify.html

maybe the intention is to replace any thing in dots that doesn't have a name with what's in cur_expers

Yes you are exactly right. I just realized that because map_chr returns a named vector I can equally get the names from names(cur_exprs). Also coalesce might work here too. What about this?

cur_exprs <- map_chr(dots, as_label) transmute_order <- dplyr::coalesce(na_if(names(cur_exprs), ""), cur_exprs)

dplyr::coalesce works here (and since this is a dplyr function, we can be pretty sure dplyr is already installed), but generally we try to shy away from using dplyr as a dependency.

So I would say let's get the names from cur_exprs but then do the more base transmute_order[is.na(names(...))] <- ... assignment style. IMO it also is slightly more explicit about what we're doing (replacing anything that doesn't have a name with a name-like representation of itself)

r/tests/testthat/test-dplyr-mutate.R

jonkeane

This is fantastic! Thanks for this

ursabot · 2022-04-09T00:51:13Z

Benchmark runs are scheduled for baseline = f0b5c49 and contender = b829943. b829943 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.17% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.71% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.17% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/473| b8299436 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/458| b8299436 test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/459| b8299436 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/468| b8299436 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/472| f0b5c49a ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/457| f0b5c49a test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/458| f0b5c49a ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/467| f0b5c49a ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…option is set This PR does two things to match some dplyr behaviour around column order: 1) Mimics dplyr implementation of `mutate(..., .keep = "none")` to append new columns after the existing columns (if suggested) as [per](tidyverse/dplyr#6086) 2) As per this [discussion](tidyverse/dplyr#6086), this required a bespoke approach to `transmute` as it not simply a wrapper for `mutate(..., .keep = "none")`. This cascades into needing to catch a couple edge cases. I have also added some tests which will test for this behaviour. Closes apache#12818 from boshek/mutate-keep Authored-by: SAm Albers <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>

boshek added 2 commits April 6, 2022 11:54

match dplyr behaviour for .keep='none'

29676b2

match transmute behaviour to dplyr

93f8aa1

github-actions bot added the Component: R label Apr 6, 2022

jonkeane requested changes Apr 6, 2022

View reviewed changes

boshek added 4 commits April 6, 2022 15:39

test transmute with unnamed arguments

d42b9d0

remove comment about equivalence to transmute

a485bae

align style

535c804

clear up intent of transmute order

0bab8e1

jonkeane approved these changes Apr 7, 2022

View reviewed changes

jonkeane closed this in b829943 Apr 7, 2022

boshek deleted the mutate-keep branch April 8, 2022 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set #12818

ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set #12818

boshek commented Apr 6, 2022

github-actions bot commented Apr 6, 2022

github-actions bot commented Apr 6, 2022

jonkeane left a comment

jonkeane Apr 6, 2022

boshek Apr 6, 2022

jonkeane Apr 7, 2022

boshek Apr 7, 2022

jonkeane Apr 7, 2022

boshek Apr 7, 2022

jonkeane left a comment

ursabot commented Apr 9, 2022

ARROW-16038: [R] different behavior from dplyr when mutate's .keep option is set #12818

ARROW-16038: [R] different behavior from dplyr when mutate's .keep option is set #12818

Conversation

boshek commented Apr 6, 2022

github-actions bot commented Apr 6, 2022

github-actions bot commented Apr 6, 2022

jonkeane left a comment

Choose a reason for hiding this comment

jonkeane Apr 6, 2022

Choose a reason for hiding this comment

boshek Apr 6, 2022

Choose a reason for hiding this comment

jonkeane Apr 7, 2022

Choose a reason for hiding this comment

boshek Apr 7, 2022

Choose a reason for hiding this comment

jonkeane Apr 7, 2022

Choose a reason for hiding this comment

boshek Apr 7, 2022

Choose a reason for hiding this comment

jonkeane left a comment

Choose a reason for hiding this comment

ursabot commented Apr 9, 2022

ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set #12818

ARROW-16038: [R] different behavior from dplyr when mutate's `.keep` option is set #12818