-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column Select Helper #4248
Column Select Helper #4248
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4248 +/- ##
=======================================
Coverage 99.61% 99.61%
=======================================
Files 72 72
Lines 13916 13937 +21
=======================================
+ Hits 13862 13883 +21
Misses 54 54 Continue to review full report at Codecov.
|
R/utils.R
Outdated
|
||
if (is.call(colsub)){ | ||
# fix for #1216, make sure the parentheses are peeled from expr of the form (((1:4))) | ||
while(colsub %iscall% "(") colsub = as.list(colsub)[[-1L]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should include {
here too maybe?
if (length(colsub) == 3L && colsub[[1L]] == ":") { | ||
if (is.name(colsub[[2L]])){ | ||
# cols is of the format a:c | ||
rnge = chmatch(c(as.character(colsub[[2L]]), as.character(colsub[[3L]])), names(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about !is.name(colsub[[3L]])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, some recent commit in my branch was addressing cases like var1:5
or 1:var1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now both need to be names. If either is a character, a new error is raised. Otherwise, it evaluates in the parent frame. I largely followed Jan's work so now V2 -V1
errors as well.
Looks great overall! A much needed refactor |
My general advice would be to split |
I would say this is unexpected. Unless of course is a character vector specifying columns, or a numeric, or logic of length ncol(DT), or a function. See my inline comment as well. |
IMO .SDcols when being logical should always be equal length to ncol, otherwise raise exception. See my inline comment as well. |
include gettextf change from origin_call to mode repeat logical vectors of length 1
R/utils.R
Outdated
if (is.logical(cols)) { | ||
if ((col_len <- length(cols)) == 1L) { | ||
cols = rep(cols, length.out = x_len) | ||
} else if (col_len != x_len) { | ||
## TODO change to error in 2022 | ||
warning(gettextf("When %s is a logical vector, each column should have a TRUE or FALSE entry. The current logical vector of length %d will be repeated to length of data.table. Warning will change to error in the next verion.", mode, col_len, domain = "R-data.table")) | ||
cols = rep(cols, length.out = x_len) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE: recycling logical vectors. Base recycles vectors to the length of the data.frame. See iris[, TRUE]
or iris[, c(TRUE, FALSE)]
. I guess I'm leaning towards allowing vectors repeat.
But more broadly, I am interested in some way to select all the variables. colsub = TRUE
seemed like a quick and easy way. names(.SD)
could work but if this function was ever applied to duplicated
or melt
or any other functions that allow column selection, I am not sure names(.SD)
would make sense. Maybe a made up call all_cols()
that we could evaluate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your use of colsub=TRUE can be achieved by not passing SDcols, or passing names, or rep(TRUE, ncol(DT)) or seq_along. Most of which requires you to refer to DT, which is not chaining friendly. You can always use a function (...) TRUE to workaround this limitation.
In the issue related to logical vector in SDcols we discussed recycling and so far conclusion was to not recycle.
I think everything has been addressed. Rolled back the two semi-big changes where I did not include For additional scrutiny, please take a second look at the warnings / stops. Thanks for all the comments and time you two spent reviewing. |
Also, I am more than happy to change this to C. Based on my initial benchmarks, Jan's C method is generally faster - I have a Similarly, I could start work on the # remotes::install_github("Rdatatable/data.table", ref = "colselect")
library(data.table)
x = as.data.table(lapply(1:5, c))
e = quote(1:3)
microbenchmark::microbenchmark(
col_helper(x, e, ".SDcols"),
data.table:::exprCols(x, 1:3, ".SDcols", TRUE, environment())
)
Unit: microseconds
expr min lq mean median uq max neval
col_helper(x, e, ".SDcols") 17.1 18.10 20.966 19.80 20.40 104.9 100
data.table:::exprCols(x, 1:3, ".SDcols", TRUE, environment()) 16.5 17.45 19.264 19.05 19.85 56.6 100
e = quote(V1:V3)
microbenchmark::microbenchmark(
col_helper(x, e, ".SDcols"),
data.table:::exprCols(x, V1:V3, ".SDcols", TRUE, environment())
)
Unit: microseconds
expr min lq mean median uq max neval
col_helper(x, e, ".SDcols") 21.4 22.4 25.375 23.7 24.3 104.9 100
data.table:::exprCols(x, V1:V3, ".SDcols", TRUE, environment()) 17.2 18.1 20.082 20.4 20.9 56.2 100
cols = c("V1", "V2", "V3")
e = quote(cols)
microbenchmark::microbenchmark(
col_helper(x, e, ".SDcols"),
data.table:::exprCols(x, cols, ".SDcols", TRUE, environment())
)
Unit: microseconds
expr min lq mean median uq max neval
col_helper(x, e, ".SDcols") 11.1 12.0 13.740 13.20 13.9 44.7 100
data.table:::exprCols(x, cols, ".SDcols", TRUE, environment()) 17.5 18.7 21.313 19.35 20.2 92.0 100 |
If |
Just saw the edit. I will work on including |
Of course the option would be only for a while, to ensure users are not affected. |
The strangest part of While I don't mind having a slow deprecation of names being in the parent.frame, it would make this PR easier if we could break the use of names in list referring to variables out of frame. Otherwise, it is a lot of checks for a use case that should be discouraged - we should have never let out-of-frame variables be in the by!
|
The silence is deafening for breaking I assume changing errors / warnings is OK. I also assume new behavior is OK (e.g.,
|
New behaviour like providing function to 'by' is better to be avoided. If no one requested that then there is not really a need (yet) to have it, and maintain it. Changed messages are fine. Changed behaviour is fine as long as there is an issue for that, where there is an agreement on the change. Changed behaviour that is not really a fix, but change to API has to be optional, like list(parent_scope_var). Then affected users can migrate the code more easily. |
This PR is complete for |
Closes #4115
Closes #4231.
Towards #852.
33 errors until
by
is included...See also @jangorecki very detailed approach in C in #4174. I see this as a stepping stone to getting towards Jan's C solution as we solve some of the API issues.
Mostly internal replacing the
.SDcols
evaluation with a new helper function with slight refactoring.Functionality differences:
(-1L)
and the integer is within the column range of dt, the set difference is returned (e.g. someone did.SDcols = (-3L)
on a 5 column data.table which would returnc(1L, 2L, 4L, 5L)
.Internal Differences
1:3
orV1:V3
is evaluated was changed to be more performant.%iscall%
is used less frequently because the parsing is evaluated differently.patterns
does not actually includedo_patterns
. In the context of.SDcols
,j
, orby
, the argument ofpatterns(..., cols = SOMETHING)
does not make sense. Instead,grep
is used directly in the context ofnames(x)
.Random note
DT
in the global environment. Both 2036.1 and 2036.2 produce the new variable. The tests are below:To Do:
Long Term: