columns selection helper for .SDcols, j expr #4174

jangorecki · 2020-01-13T17:57:39Z

In general towards more modular [.data.table #852 and step by step push-down it to C.
In particular this PR should address #4115 (#2178), #4004, #2069, #2059 and probably some others as well... #4231, #4235
Still very much WIP, pushing now because @ColeMiller1 is also developing around those issues so good to not duplicate the same work.
Note that I will be offline this week, so take your time giving feedback.

There is one incosnsistency that we have to work out:
j now supports column selection by single symbol, unquoted column name.
.SDcols now supports columns selection by single symbol, function name.
Those cannot be easily combined, if a data.table will have a column of the same name as the function that user is trying to use then the function will be ignored (eventually .. prefix will help here). The tricky part is that to select column by symbol we do not want to evaluate symbol, but to check if the symbol is a function we have to evaluate it. Lets leave it for now. Once PR will be more mature I will come back to that issue with illustrative example.

ColeMiller1 · 2020-01-16T23:16:07Z

Since .SDcols evaluates to ansvars and ansvals, it would be good for the function to either return a named vector or a list of a character vector and integer vector.

It would be good to bring by into the fray. Below is a compiled list - hopefully it will be helpful instead of just too much stuff:

dt = as.data.table(lapply(1:5, c))

Names that are within the data.table

type	j	.SDcols	by	notes
V1	+	error	+	j selects vector - by keeps as list
-V1	~	error	~	makes negative
!V1	~	error	~	makes logical
V1:V3	+	+	+
-V1:V3	error	error	error	trying to do -1:3 which is pos and neg index
-(V1:V3)	+	+	error
!V1:V3	~	+	error	j makes logical
!(V1:V3)	+	+	error
V3:V1	+	+	+
-V3:V1	error	error	error	+ve and -ve indices
-(V3:V1)	+	+	error
!V3:V1	~	+	error	j selects logical vector
!(V3:V1)	+	+	error

Quoted names that are within the data.table

type	j	.SDcols	by	notes
'V1'	+	+	+
-'V1'	+	+	error
!'V1'	+	+	error
c('V1', 'V3')	+	+	+
-c('V1', 'V3')	+	+	error
!c('V1', 'V3')	+	+	error
'V1,V3'	error	error	+	recommend removing
-'V1,V3'	error	error	error
!'V1,V3'	error	error	error

numbers

type	j	.SDcols	by	notes
1L	+	+	~	by creates new unnamed list
-1L	+	+	~	by creates new list named "-"
!1L	+	+	~	by creates new logical named "!"
1	+	+	~	see above
-1	+	+	~	see above
!1	+	+	~	see above
1:2	+	+	error
-1:2	error	error	error	creates +ve and -ve indices
-(1:2)	+	+	error
!1:2	+	+	error
!(1:2)	+	+	error
c(1,2)	+	+	error
-c(1,2)	+	+	error
!c(1,2)	+	+	error

character vector char_cols = c("V1", "V3")

type	j	.SDcols	by	notes
char_cols	errors	+	+	j has helpful error
-char_cols	errors	+	errors
!char_cols	errors	+	errors
..char_cols	+	errors	errors
!..char_cols	+	errors	errors
-..char_cols	+	errors	errors

character vector V1 = c("V1", "V3")

type	j	.SDcols	by	notes
V1	~	~	~	j extracts vector from dt; .SDcols selects the character vector names
-V1	~	~	~	j and by become logical
!V1	~	~	~	j and by become logical
..V1	+	error	error
-..V1	+	error	error
!..V1	+	error	error

int_cols = c(1,3)

type	j	.SDcols	by	notes
int_cols	errors	+	errors	helpful j error
-int_cols	~	+	error	j produces negative vector
!int_cols	~	error	error	j produces logical
..int_cols	+	errors	errors
-..int_cols	+	errors	errors
!..int_cols	+	errors	errors

lgl_cols = c(TRUE, TRUE, FALSE, FALSE, FALSE)

type	j	.SDcols	by	notes
lgl_cols	error	+	error	j has helpful error
-lgl_cols	~	+	error	j returns c(-1,-1,0,0,0)
!lgl_cols	~	+	error	j returns !lgl_cols
..lgl_cols	+	error	error
-..lgl_cols	+	error	error
!..lgl_cols	+	error	error

lgl_cols = c(TRUE, FALSE)

type	j	.SDcols	by	notes
lgl_cols	error	~	error	j has helpful error; .SDcols repeats vector
-lgl_cols	error	~	error	j has helpful error; .SDcols repeats vector
!lgl_cols	error	~	error	j has helpful error; .SDcols repeats vector
..lgl_cols	~	error	error	j selects only the first column
-..lgl_cols	~	error	error	j selects only the first column
!..lgl_cols	~	error	error	j selects only the first column

patterns

type	j	.SDcols	by
patterns('V1')	error	+	error
!patterns('V1')	error	+	error
-patterns('V1')	error	+	error
patterns('V1','V3')	error	+	error
!patterns('V1','V3')	error	+	error
-patterns('V1','V3')	error	+	error

function

type	j	.SDcols	by	notes
is.numeric	error	+	error	j has helpful error for some reason
!is.numeric	error	+	error
-is.numeric	error	+	error
function(x) is.numeric(x)	error	+	error	j has helpful error for some reason
!function(x) is.numeric(x)	error	+	error
-function(x) is.numeric(x)	error	+	error

calls; char_cols = c("V1", "V3")

type	j	.SDcols	by	notes
c(char_cols, "V2")	~	+	+	j produces char vector
c(..char_cols, "V2")	+	error	error
c(2L, lapply(.SD, '*', 2))	+	error	error	j returns additional column

jangorecki · 2020-01-17T08:05:44Z

@ColeMiller1 yes, tables are useful, good roadmap for unit tests.

src/utils.c

MichaelChirico · 2020-02-02T16:09:52Z

src/utils.c

+ */
+SEXP replace_dot_alias(SEXP x) {
+  if (isLanguage(x) && !isFunction(CAR(x))) {
+    if (CAR(x) == sym_bquote) // handling `.` inside bquote, #1912


nice (though a bit surprising) that == works here!

Maybe because it is not STRSXP but CHARSXP

jangorecki · 2020-02-05T05:25:33Z

just documenting current behaviour of peeling from (

> x[, c("V1","V3")]
      V1    V3
   <int> <int>
1:     1     3
> x[, ((c("V1","V3")))]
[1] "V1" "V3"
> x[, .SD, .SDcols=c("V1","V3")]
      V1    V3
   <int> <int>
1:     1     3
> x[, .SD, .SDcols=((c("V1","V3")))]
      V1    V3
   <int> <int>
1:     1     3

I think it make sense to keep it like that so extra ( can be used by user to escape our NSE dispatch to column selection. This is handled in 52fed7c#diff-74710e3ecbe94109f39de34133e1c8caR341

jangorecki · 2020-02-05T13:59:18Z

R -q -e 'cc("colselect.Rraw")'

can be used for spotting regression when doing changes for addressing new tests

jangorecki · 2020-02-11T13:57:50Z

Current behaviour of handling peeling from ( might be confusing when combined with !. ( in top call of expression in j escapes the column selection.

> x[, !("V5")]
      V1    V2    V3    V4
   <int> <int> <int> <int>
1:     1     2     3     4
> x[, ("V5")]
[1] "V5"

Just documenting it here.
If you think it should be discussed/changed, please fill a new issue.

ColeMiller1 · 2020-02-11T22:02:55Z

R/data.table.R

@@ -213,8 +203,12 @@ replace_dot_alias = function(e) {
  av = NULL
  jsub = NULL
  if (!missing(j)) {
+    colselect = getOption("datatable.colselect", FALSE)


How long would this be an option?

I will write more on that when PR will be more mature. It has to be an option unfortunately, there are so many corner cases than it is very unlikely to address them maintaining backward compatibility.

MichaelChirico · 2020-02-16T16:19:37Z

man/duplicated.Rd

-are being used. That was changed recently for consistency to data.frame methods.
+\item{by}{\code{character}, \code{integer} or \code{logical} vector indicating
+which combinations of columns from \code{x} to use for uniqueness checks. By
+default all columns are being used. That was changed recently for consistency to


probably we can either remove the last sentence or at least "recently" since it's from >3 years ago:

da5974b

MichaelChirico · 2020-02-16T16:23:26Z

R/data.table.R

-  }
-  e
-}
+replace_dot_alias = function(e) .Call(Creplace_dot_aliasR, e)


may as well move this to wrappers.R

jangorecki · 2020-02-17T03:33:27Z

This PR attempts to implement C helper function to investigate j and .SDcols arguments. It guesses if column selection was intended (just selecting columns with=FALSE) or it leaves as expression to be evaluated later on in scope of data.table (with=TRUE).
It is unfinished, and as of now I am not planning to work on it further, thus closing. The complexity of handling all cases is tricky enough. Basically there other more useful things I want to allocate my time into. Anyone is welcome to pick it up. What is left is to address requirements defined by tests.

cc("colselect.Rraw")
options(datatable.colselect=TRUE); cc("tests.Rraw")

We could isolate the logic into R helper function instead. Then it will be definitely less complex. On the other hand C implementation gives a fine-grained definition of column selection logic, which is very valuable. Keep in mind that speed is not a concern as we don't operate on data.

ColeMiller1 · 2020-02-17T13:22:21Z

Was the .SDcols mode far enough along? If so, I can work on pushing that portion of the code for C. Otherwise, I will push an R solution - it's a lot easier for me to test and develop.

P.S., All the tests and and work you did was super impressive.

ColeMiller1 · 2020-02-20T12:07:38Z

src/utils.c

+    peeled = true;
+  }
+  SEXP sym_brackets = install("{");
+  while (isLanguage(expr) && CAR(expr)==sym_brackets && length(expr)==2) {


@jangorecki Is isLanguage equivalent to is.call()? Also, why do the brackets need the length(expr) == 2)? Is that to only peel one layer? Also, I'm not sure for etiquette on closed PRs or if you even receive notifications.

here's the source for is.call

https://github.com/wch/r-source/blob/trunk/src/main/coerce.c#L2072-L2073

case 300: /* is.call */ LOGICAL0(ans)[0] = TYPEOF(CAR(args)) == LANGSXP;

almost identical to isLanguage:

https://github.com/wch/r-source/blob/trunk/src/include/Rinlinedfuns.h#L903-L906

INLINE_FUN Rboolean isLanguage(SEXP s) { return (s == R_NilValue || TYPEOF(s) == LANGSXP); }

So it appears that for how we set up our functions, TYPEOF(s) == LANGSXP would be equivalent of is.call() since I believe args is normally a list for the C arguments. I am asking because I am interested in a more direct %iscall%. Thanks!

only { needs length check because length(quote({1; 2})) should not be peeled. ( will be always length 2.

ColeMiller1 · 2020-03-02T12:05:40Z

src/utils.c

+        SET_TAG(CDDDR(lapply), install("pos"));
+        SEXP env = eval(lapply, rho);
+        SEXP evaleval = lang3(install("substitute"), expr, env);
+        SEXP cols = eval(eval(evaleval, rho), rho); // huh...


SEXP cols = eval(eval(evaleval, rho), rho); // huh...

+1.

So, I understand we need to replace c(..cols1, ..cols2), did you get as far as c(..cols1, lapply(.SD, sum))? I see a little farther you reference the clash between the two modes.

I don't think so it did worked like this already

jangorecki added 3 commits January 9, 2020 18:08

C replace_dot_alias

8336d24

colnamesInt get support for logical

f2b7ded

dev cols selection helper

5072be3

jangorecki added the WIP label Jan 13, 2020

copy-paste typo fix

8a3e6e2

ColeMiller1 reviewed Jan 17, 2020

View reviewed changes

src/utils.c Outdated Show resolved Hide resolved

ColeMiller1 reviewed Jan 17, 2020

View reviewed changes

src/utils.c Outdated Show resolved Hide resolved

ColeMiller1 reviewed Jan 17, 2020

View reviewed changes

src/utils.c Outdated Show resolved Hide resolved

ColeMiller1 reviewed Jan 17, 2020

View reviewed changes

src/utils.c Outdated Show resolved Hide resolved

jangorecki added 7 commits January 17, 2020 17:11

amend to feedback from PR

1ed7468

proper place

4a26eb8

avoid : in favor of c

3ecb82e

amend feedback from Cole

6e008b5

dev

2fb8b1a

proper order not to mask patterns

a1d9bd8

cleanup tests

cc728db

ColeMiller1 reviewed Jan 19, 2020

View reviewed changes

src/utils.c Show resolved Hide resolved

ColeMiller1 reviewed Jan 19, 2020

View reviewed changes

src/utils.c Outdated Show resolved Hide resolved

ColeMiller1 reviewed Jan 19, 2020

View reviewed changes

src/utils.c Outdated Show resolved Hide resolved

mostly unit tests according to Cole's table

75786ad

ColeMiller1 reviewed Jan 24, 2020

View reviewed changes

src/utils.c Show resolved Hide resolved

ColeMiller1 mentioned this pull request Feb 2, 2020

names(.SD) should work #4163

Merged

4 tasks

MichaelChirico reviewed Feb 2, 2020

View reviewed changes

MichaelChirico mentioned this pull request Feb 2, 2020

Call helper #4226

Merged

jangorecki added the NSE label Feb 4, 2020

test driven dev

a0e5708

test driven dev2

52fed7c

jangorecki added 7 commits February 5, 2020 20:36

test driven dev3

62c2c8c

dotdot prefix handling, dev

6624f5c

dotdot handling in j

b5fd865

test driven dev4

69abd83

test driven dev5

c5d2134

minor cleanup

2642b03

more tests

11660d0

jangorecki added 2 commits February 11, 2020 20:06

corner cases handling

6ceeb59

more corner cases

f27de2e

ColeMiller1 reviewed Feb 11, 2020

View reviewed changes

jangorecki removed the NSE label Feb 16, 2020

jangorecki added 6 commits February 16, 2020 16:14

tests

8218870

merge to master

085c662

cleanup old test workspace

127990a

fixes for types

9a2dac4

plug into R, amend main tests

0a9ee19

address tests.sts.Rraw regression

7a02929

MichaelChirico reviewed Feb 16, 2020

View reviewed changes

R/data.table.R

}

e

}

replace_dot_alias = function(e) .Call(Creplace_dot_aliasR, e)

Copy link

Member

MichaelChirico Feb 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may as well move this to wrappers.R

jangorecki closed this Feb 17, 2020

ColeMiller1 mentioned this pull request Feb 18, 2020

Column Select Helper #4248

Closed

1 task

ColeMiller1 reviewed Feb 20, 2020

View reviewed changes

ColeMiller1 reviewed Mar 2, 2020

View reviewed changes

ColeMiller1 mentioned this pull request May 22, 2020

.SDcols peeling outer ( and edge case of :, and a logical of diff length #4470

Merged

jangorecki mentioned this pull request May 13, 2021

Allow env dot alias in substitute2 and [.data.table #4994

Closed

MichaelChirico removed the WIP label Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

columns selection helper for .SDcols, j expr #4174

columns selection helper for .SDcols, j expr #4174

jangorecki commented Jan 13, 2020 •

edited

Loading

ColeMiller1 commented Jan 16, 2020 •

edited by jangorecki

Loading

jangorecki commented Jan 17, 2020

MichaelChirico Feb 2, 2020

jangorecki Feb 3, 2020

jangorecki commented Feb 5, 2020 •

edited

Loading

jangorecki commented Feb 5, 2020 •

edited

Loading

jangorecki commented Feb 11, 2020 •

edited

Loading

ColeMiller1 Feb 11, 2020

jangorecki Feb 12, 2020

MichaelChirico Feb 16, 2020 •

edited

Loading

MichaelChirico Feb 16, 2020

jangorecki commented Feb 17, 2020

ColeMiller1 commented Feb 17, 2020

ColeMiller1 Feb 20, 2020

MichaelChirico Feb 20, 2020

MichaelChirico Feb 20, 2020

ColeMiller1 Feb 20, 2020

jangorecki Feb 21, 2020

ColeMiller1 Mar 2, 2020

jangorecki Mar 2, 2020

columns selection helper for .SDcols, j expr #4174

columns selection helper for .SDcols, j expr #4174

Conversation

jangorecki commented Jan 13, 2020 • edited Loading

ColeMiller1 commented Jan 16, 2020 • edited by jangorecki Loading

Names that are within the data.table

Quoted names that are within the data.table

numbers

character vector char_cols = c("V1", "V3")

character vector V1 = c("V1", "V3")

int_cols = c(1,3)

lgl_cols = c(TRUE, TRUE, FALSE, FALSE, FALSE)

lgl_cols = c(TRUE, FALSE)

patterns

function

calls; char_cols = c("V1", "V3")

jangorecki commented Jan 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki commented Feb 5, 2020 • edited Loading

jangorecki commented Feb 5, 2020 • edited Loading

jangorecki commented Feb 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelChirico Feb 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki commented Feb 17, 2020

ColeMiller1 commented Feb 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki commented Jan 13, 2020 •

edited

Loading

ColeMiller1 commented Jan 16, 2020 •

edited by jangorecki

Loading

jangorecki commented Feb 5, 2020 •

edited

Loading

jangorecki commented Feb 5, 2020 •

edited

Loading

jangorecki commented Feb 11, 2020 •

edited

Loading

MichaelChirico Feb 16, 2020 •

edited

Loading