Column Select Helper #4248

ColeMiller1 · 2020-02-18T02:20:08Z

Closes #4115
Closes #4231.
Towards #852.

33 errors until by is included...

See also @jangorecki very detailed approach in C in #4174. I see this as a stepping stone to getting towards Jan's C solution as we solve some of the API issues.

Mostly internal replacing the .SDcols evaluation with a new helper function with slight refactoring.

Functionality differences:

If .SDcols evaluates to (-1L) and the integer is within the column range of dt, the set difference is returned (e.g. someone did .SDcols = (-3L) on a 5 column data.table which would return c(1L, 2L, 4L, 5L).
If .SDcols is a logical greater than length 1 but less than the length of the dt, a warning message is displayed. The long term solution will be for it to throw an error and later on silently fail.

Internal Differences

The way 1:3 or V1:V3 is evaluated was changed to be more performant.

x = data.table(V1 = 1, V2 = 2, V3 = 1, V4= 5, V5 = 5)
colsub = quote(V1:V3)
microbenchmark::microbenchmark(
  new_way = {
    rnge = data.table:::chmatch(c(as.character(colsub[[2L]]), as.character(colsub[[3L]])), names(x))
    cols = rnge[1L]:rnge[2L]
  },
  old_way =  eval(colsub, data.table:::setattr(as.list(seq_along(x)), 'names', names(x)), parent.frame())
)
##Unit: microseconds
##    expr  min    lq   mean median    uq   max neval
## new_way 20.8 22.05 31.324   23.0 28.85 145.4   100
## old_way 27.8 28.90 35.059   29.7 30.70 137.2   100

colsub = quote(1:3)
microbenchmark::microbenchmark(
  new_way =  eval(colsub),
  old_way =  eval(colsub, data.table:::setattr(as.list(seq_along(x)), 'names', names(x)), parent.frame())
)
##Unit: microseconds
##    expr  min   lq   mean median    uq  max neval
## new_way 10.6 11.4 12.689   11.8 12.20 67.4   100
## old_way 27.2 28.0 30.421   28.5 29.15 70.9   100

The cool new %iscall% is used less frequently because the parsing is evaluated differently.
The evaluation of patterns does not actually include do_patterns. In the context of .SDcols, j, or by, the argument of patterns(..., cols = SOMETHING) does not make sense. Instead, grep is used directly in the context of names(x).

Random note

At the end of the tests, there is a new variable DT in the global environment. Both 2036.1 and 2036.2 produce the new variable. The tests are below:

setup = c('DT = data.table(a = 1)')
writeLines(c(setup, 'DT[ , a := 1]'), tmp<-tempfile())
test(2036.1, !any(grepl("1:     1", capture.output(source(tmp, echo = TRUE)), fixed = TRUE)))
## test force-printing still works
writeLines(c(setup, 'DT[ , a := 1][]'), tmp)
test(2036.2, source(tmp, echo = TRUE), output = "1:\\s+1")

To Do:

See if any more tests are needed

Long Term:

Include by vars and j vars
Consolidate other column select which would likely include duplicated, unique, setcolorder, setnames, and probably more.
See what vignettes could be updated.

codecov · 2020-02-18T02:32:01Z

Codecov Report

Merging #4248 into master will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #4248   +/-   ##
=======================================
  Coverage   99.61%   99.61%           
=======================================
  Files          72       72           
  Lines       13916    13937   +21     
=======================================
+ Hits        13862    13883   +21     
  Misses         54       54

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1b1832...4961257. Read the comment docs.

inst/tests/tests.Rraw

R/utils.R

MichaelChirico · 2020-02-18T03:11:27Z

R/utils.R

+
+  if (is.call(colsub)){
+    # fix for #1216, make sure the parentheses are peeled from expr of the form (((1:4)))
+    while(colsub %iscall% "(") colsub = as.list(colsub)[[-1L]]


should include { here too maybe?

R/utils.R

MichaelChirico · 2020-02-18T03:13:53Z

R/utils.R

+    if (length(colsub) == 3L && colsub[[1L]] == ":") {
+      if (is.name(colsub[[2L]])){
+        # cols is of the format a:c
+        rnge = chmatch(c(as.character(colsub[[2L]]), as.character(colsub[[3L]])), names(x))


what about !is.name(colsub[[3L]])?

exactly, some recent commit in my branch was addressing cases like var1:5 or 1:var1

Now both need to be names. If either is a character, a new error is raised. Otherwise, it evaluates in the parent frame. I largely followed Jan's work so now V2 -V1 errors as well.

R/utils.R

MichaelChirico · 2020-02-18T03:31:23Z

Looks great overall! A much needed refactor

jangorecki · 2020-02-21T05:18:29Z

My general advice would be to split j handling from .SDcols handling into separte function, that eventually share some internal functions.

jangorecki · 2020-02-21T05:19:11Z

.SDcols now accepts a single name as an argument such as data.table(V1 = 1)[, .SD, .SDcols = V1]

I would say this is unexpected. Unless of course is a character vector specifying columns, or a numeric, or logic of length ncol(DT), or a function.

See my inline comment as well.

jangorecki · 2020-02-21T05:20:16Z

If .SDcols is TRUE, all columns will be returned which is a step towards group_by_all or just a generic way to select all columns.

IMO .SDcols when being logical should always be equal length to ncol, otherwise raise exception.

See my inline comment as well.

include gettextf change from origin_call to mode repeat logical vectors of length 1

ColeMiller1 · 2020-02-22T23:07:58Z

R/utils.R

+  if (is.logical(cols)) {
+    if ((col_len <- length(cols)) == 1L) {
+      cols = rep(cols, length.out = x_len)
+    } else if (col_len != x_len) {
+      ## TODO change to error in 2022
+      warning(gettextf("When %s is a logical vector, each column should have a TRUE or FALSE entry. The current logical vector of length %d will be repeated to length of data.table. Warning will change to error in the next verion.", mode, col_len, domain = "R-data.table"))
+      cols = rep(cols, length.out = x_len)
+    } 


RE: recycling logical vectors. Base recycles vectors to the length of the data.frame. See iris[, TRUE] or iris[, c(TRUE, FALSE)]. I guess I'm leaning towards allowing vectors repeat.

But more broadly, I am interested in some way to select all the variables. colsub = TRUE seemed like a quick and easy way. names(.SD) could work but if this function was ever applied to duplicated or melt or any other functions that allow column selection, I am not sure names(.SD) would make sense. Maybe a made up call all_cols() that we could evaluate?

Your use of colsub=TRUE can be achieved by not passing SDcols, or passing names, or rep(TRUE, ncol(DT)) or seq_along. Most of which requires you to refer to DT, which is not chaining friendly. You can always use a function (...) TRUE to workaround this limitation.
In the issue related to logical vector in SDcols we discussed recycling and so far conclusion was to not recycle.

R/utils.R

ColeMiller1 · 2020-02-23T03:35:21Z

I think everything has been addressed. Rolled back the two semi-big changes where .SDcols = TRUE would return all columns and .SDcols = V1 where V1 was a variable name within dt would return the V1 column.

I did not include { brackets. (({(1:3)})) will currently work and ((((V1:V3))) will currently work. I am unsure of the use case of ({V1:V3}) but if there is still desire, it would be easy to add.

For additional scrutiny, please take a second look at the warnings / stops.

Thanks for all the comments and time you two spent reviewing.

ColeMiller1 · 2020-02-27T01:35:01Z

Also, I am more than happy to change this to C. Based on my initial benchmarks, Jan's C method is generally faster - I have a "negate" attribute while my timings for Jan's include :::. I think the R could be refactored to improve performance (e.g., if V1:V3 matches, there's no need to do additional checks) but then I would have to repeat similar code multiple times.

Similarly, I could start work on the by or j aspects but that would increase the reach of this PR further than ideal. After merging, I'd start work on by as I think it's a better candidate than j.

# remotes::install_github("Rdatatable/data.table", ref = "colselect")
library(data.table)

x = as.data.table(lapply(1:5, c))

e = quote(1:3)
microbenchmark::microbenchmark(
col_helper(x, e, ".SDcols"),
data.table:::exprCols(x, 1:3, ".SDcols", TRUE, environment())
)
Unit: microseconds
                                                          expr  min    lq   mean median    uq   max neval
                                   col_helper(x, e, ".SDcols") 17.1 18.10 20.966  19.80 20.40 104.9   100
 data.table:::exprCols(x, 1:3, ".SDcols", TRUE, environment()) 16.5 17.45 19.264  19.05 19.85  56.6   100

e = quote(V1:V3)
microbenchmark::microbenchmark(
  col_helper(x, e, ".SDcols"),
  data.table:::exprCols(x, V1:V3, ".SDcols", TRUE, environment())
)
Unit: microseconds
                                                            expr  min   lq   mean median   uq   max neval
                                     col_helper(x, e, ".SDcols") 21.4 22.4 25.375   23.7 24.3 104.9   100
 data.table:::exprCols(x, V1:V3, ".SDcols", TRUE, environment()) 17.2 18.1 20.082   20.4 20.9  56.2   100

cols = c("V1", "V2", "V3")
e = quote(cols) 
microbenchmark::microbenchmark(
  col_helper(x, e, ".SDcols"),
  data.table:::exprCols(x, cols, ".SDcols", TRUE, environment())
)
Unit: microseconds
                                                           expr  min   lq   mean median   uq  max neval
                                    col_helper(x, e, ".SDcols") 11.1 12.0 13.740  13.20 13.9 44.7   100
 data.table:::exprCols(x, cols, ".SDcols", TRUE, environment()) 17.5 18.7 21.313  19.35 20.2 92.0   100

jangorecki · 2020-02-27T02:26:30Z

If j is going to be handled in this PR, then probably best to do it now. Once PR will be ready to merge then we need to run revdeps check against this branch. What could also be useful is to give an option to escape it, so users who might got affected by this change can easily turn it off.

ColeMiller1 · 2020-03-01T22:48:52Z

Just saw the edit. I will work on including by. The only way I would start work on j is if you would be cool with a select.j(...) helper function in j. Otherwise, keeping a global option for j wouldn't really get us towards a modular data.table and would only increase the complexity of the code.

jangorecki · 2020-03-02T02:44:42Z

Of course the option would be only for a while, to ensure users are not affected.

ColeMiller1 · 2020-03-11T02:09:17Z

The strangest part of by is that it allows vectors in the parent.frame to be used for grouping. Even stranger is that we also allow these vectors to be names or even arguments in lists.

While I don't mind having a slow deprecation of names being in the parent.frame, it would make this PR easier if we could break the use of names in list referring to variables out of frame. Otherwise, it is a lot of checks for a use case that should be discouraged - we should have never let out-of-frame variables be in the by!

library(data.table)
n = 5L
dt = data.table(V1 = rnorm(n))
set.seed(1L)
out_var = sample(n, n, replace = TRUE)

dt[, sum(V1), by = out_var] ##slow deprecation
dt[, sum(V1), by = list(out_var)] ##break

ColeMiller1 · 2020-03-14T01:26:44Z

The silence is deafening for breaking by = list(parent_frame_variable) :). I have fixed it but I still plan on trying to start a slow deprecation process.

I assume changing errors / warnings is OK. I also assume new behavior is OK (e.g., by = is.factor). Is new naming convention OK assuming reverse dependency is fine?

DT = data.table(a = 1:10)
DT[ , b := 10:1]
## test 1984.04
## current:
data.table(expression = c(1, 0), V1 = c(6, 5))
   expression    V1
        <num> <num>
1:          1     6
2:          0     5

## proposed:
DT[ , mean(b), by = eval(expression(a %% 2))]
##       a    V1
##   <num> <num>
##1:     1     6
##2:     0     5

jangorecki · 2020-03-14T03:00:07Z

New behaviour like providing function to 'by' is better to be avoided. If no one requested that then there is not really a need (yet) to have it, and maintain it. Changed messages are fine. Changed behaviour is fine as long as there is an issue for that, where there is an agreement on the change. Changed behaviour that is not really a fix, but change to API has to be optional, like list(parent_scope_var). Then affected users can migrate the code more easily.

ColeMiller1 · 2020-03-14T10:16:46Z

This PR is complete for .SDcols. My goal towards a modular [data.table was to introduce consistency for column selection and reuse code where possible. I am closing as that goal does not seem possible.

ColeMiller1 added 4 commits February 17, 2020 20:11

col_helper replicates .SDcols functionality

a2f2833

change .SDcols to col_helper

9e75f82

Update tests.Rraw

086d6f6

update warning message

761e4df

ColeMiller1 added the WIP label Feb 18, 2020