Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16038: [R] different behavior from dplyr when mutate's .keep option is set #12818

Closed
wants to merge 6 commits into from
Closed

Conversation

boshek
Copy link
Contributor

@boshek boshek commented Apr 6, 2022

This PR does two things to match some dplyr behaviour around column order:

  1. Mimics dplyr implementation of mutate(..., .keep = "none") to append new columns after the existing columns (if suggested) as per

  2. As per this discussion, this required a bespoke approach to transmute as it not simply a wrapper for mutate(..., .keep = "none"). This cascades into needing to catch a couple edge cases.

I have also added some tests which will test for this behaviour.

@github-actions
Copy link

github-actions bot commented Apr 6, 2022

@github-actions
Copy link

github-actions bot commented Apr 6, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this. The behavior + history on the dplyr side is a lot. I have a few relatively minor comments and suggestions

r/R/dplyr-mutate.R Show resolved Hide resolved
r/R/dplyr-mutate.R Outdated Show resolved Hide resolved
r/tests/testthat/test-dplyr-mutate.R Show resolved Hide resolved
r/R/dplyr-mutate.R Outdated Show resolved Hide resolved
## keeping with: https://github.com/tidyverse/dplyr/issues/6086
cur_exprs <- vapply(dots, as_label, character(1))
new_col_names <- names(dots)
transmute_order <- ifelse(nzchar(new_col_names), new_col_names, cur_exprs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a bit of trouble following what this line is doing here — what behavior are we trying to catch here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very open to other approaches but the dplyr implementation of transmute return the columns in the order they are given in the function call which is now distinct from mutate(..., .keep = "none"). So this was what i came up with to preserve that order. But there possibly could be a better way to capture order?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nods sorry I should have been a bit more specific — I didn't totally follow what might cause the names of new_col_names to be empty (I imagine that's what happens with any current columns at this point, given this ifelse(): so here cur_exprs has names if it's a new column, but does not if it doesn't, yeah?), but that isn't super super obvious on first read. And actually, now that I'm looking at it, maybe the intention is to replace any thing in dots that doesn't have a name with what's in cur_expers?

Would it be possible to do something like:

cur_exprs <- map_chr(dots, as_label)
transmute_order <- names(dots)
transmute_order[nzchar(transmute_order)] <- cur_exprs[nzchar(transmute_order)]

or even something like purrr::modify or purrr::modify_if? https://purrr.tidyverse.org/reference/modify.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the intention is to replace any thing in dots that doesn't have a name with what's in cur_expers

Yes you are exactly right. I just realized that because map_chr returns a named vector I can equally get the names from names(cur_exprs). Also coalesce might work here too. What about this?

  cur_exprs <- map_chr(dots, as_label)
  transmute_order <- dplyr::coalesce(na_if(names(cur_exprs), ""), cur_exprs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dplyr::coalesce works here (and since this is a dplyr function, we can be pretty sure dplyr is already installed), but generally we try to shy away from using dplyr as a dependency.

So I would say let's get the names from cur_exprs but then do the more base transmute_order[is.na(names(...))] <- ... assignment style. IMO it also is slightly more explicit about what we're doing (replacing anything that doesn't have a name with a name-like representation of itself)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

r/tests/testthat/test-dplyr-mutate.R Outdated Show resolved Hide resolved
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic! Thanks for this

@jonkeane jonkeane closed this in b829943 Apr 7, 2022
@boshek boshek deleted the mutate-keep branch April 8, 2022 15:29
@ursabot
Copy link

ursabot commented Apr 9, 2022

Benchmark runs are scheduled for baseline = f0b5c49 and contender = b829943. b829943 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.17% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.71% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.17% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/473| b8299436 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/458| b8299436 test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/459| b8299436 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/468| b8299436 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/472| f0b5c49a ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/457| f0b5c49a test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/458| f0b5c49a ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/467| f0b5c49a ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

jcralmeida pushed a commit to rafael-telles/arrow that referenced this pull request Apr 19, 2022
…option is set

This PR does two things to match some dplyr behaviour around column order:

1) Mimics dplyr implementation of `mutate(..., .keep = "none")` to append new columns after the existing columns (if suggested) as [per](tidyverse/dplyr#6086)

2) As per this [discussion](tidyverse/dplyr#6086), this required a bespoke approach to `transmute` as it not simply a wrapper for `mutate(..., .keep = "none")`. This cascades into needing to catch a couple edge cases.

I have also added some tests which will test for this behaviour.

Closes apache#12818 from boshek/mutate-keep

Authored-by: SAm Albers <[email protected]>
Signed-off-by: Jonathan Keane <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants