Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

columns selection helper for .SDcols, j expr #4174

Closed
wants to merge 33 commits into from
Closed

Conversation

jangorecki
Copy link
Member

@jangorecki jangorecki commented Jan 13, 2020

In general towards more modular [.data.table #852 and step by step push-down it to C.
In particular this PR should address #4115 (#2178), #4004, #2069, #2059 and probably some others as well... #4231, #4235
Still very much WIP, pushing now because @ColeMiller1 is also developing around those issues so good to not duplicate the same work.
Note that I will be offline this week, so take your time giving feedback.


There is one incosnsistency that we have to work out:
j now supports column selection by single symbol, unquoted column name.
.SDcols now supports columns selection by single symbol, function name.
Those cannot be easily combined, if a data.table will have a column of the same name as the function that user is trying to use then the function will be ignored (eventually .. prefix will help here). The tricky part is that to select column by symbol we do not want to evaluate symbol, but to check if the symbol is a function we have to evaluate it. Lets leave it for now. Once PR will be more mature I will come back to that issue with illustrative example.

@jangorecki jangorecki added the WIP label Jan 13, 2020
@ColeMiller1
Copy link
Contributor

ColeMiller1 commented Jan 16, 2020

Since .SDcols evaluates to ansvars and ansvals, it would be good for the function to either return a named vector or a list of a character vector and integer vector.

It would be good to bring by into the fray. Below is a compiled list - hopefully it will be helpful instead of just too much stuff:

dt = as.data.table(lapply(1:5, c))

Names that are within the data.table

type j .SDcols by notes
V1 + error + j selects vector - by keeps as list
-V1 ~ error ~ makes negative
!V1 ~ error ~ makes logical
V1:V3 + + +
-V1:V3 error error error trying to do -1:3 which is pos and neg index
-(V1:V3) + + error
!V1:V3 ~ + error j makes logical
!(V1:V3) + + error
V3:V1 + + +
-V3:V1 error error error +ve and -ve indices
-(V3:V1) + + error
!V3:V1 ~ + error j selects logical vector
!(V3:V1) + + error

Quoted names that are within the data.table

type j .SDcols by notes
'V1' + + +
-'V1' + + error
!'V1' + + error
c('V1', 'V3') + + +
-c('V1', 'V3') + + error
!c('V1', 'V3') + + error
'V1,V3' error error + recommend removing
-'V1,V3' error error error
!'V1,V3' error error error

numbers

type j .SDcols by notes
1L + + ~ by creates new unnamed list
-1L + + ~ by creates new list named "-"
!1L + + ~ by creates new logical named "!"
1 + + ~ see above
-1 + + ~ see above
!1 + + ~ see above
1:2 + + error
-1:2 error error error creates +ve and -ve indices
-(1:2) + + error
!1:2 + + error
!(1:2) + + error
c(1,2) + + error
-c(1,2) + + error
!c(1,2) + + error

character vector char_cols = c("V1", "V3")

type j .SDcols by notes
char_cols errors + + j has helpful error
-char_cols errors + errors
!char_cols errors + errors
..char_cols + errors errors
!..char_cols + errors errors
-..char_cols + errors errors

character vector V1 = c("V1", "V3")

type j .SDcols by notes
V1 ~ ~ ~ j extracts vector from dt; .SDcols selects the character vector names
-V1 ~ ~ ~ j and by become logical
!V1 ~ ~ ~ j and by become logical
..V1 + error error
-..V1 + error error
!..V1 + error error

int_cols = c(1,3)

type j .SDcols by notes
int_cols errors + errors helpful j error
-int_cols ~ + error j produces negative vector
!int_cols ~ error error j produces logical
..int_cols + errors errors
-..int_cols + errors errors
!..int_cols + errors errors

lgl_cols = c(TRUE, TRUE, FALSE, FALSE, FALSE)

type j .SDcols by notes
lgl_cols error + error j has helpful error
-lgl_cols ~ + error j returns c(-1,-1,0,0,0)
!lgl_cols ~ + error j returns !lgl_cols
..lgl_cols + error error
-..lgl_cols + error error
!..lgl_cols + error error

lgl_cols = c(TRUE, FALSE)

type j .SDcols by notes
lgl_cols error ~ error j has helpful error; .SDcols repeats vector
-lgl_cols error ~ error j has helpful error; .SDcols repeats vector
!lgl_cols error ~ error j has helpful error; .SDcols repeats vector
..lgl_cols ~ error error j selects only the first column
-..lgl_cols ~ error error j selects only the first column
!..lgl_cols ~ error error j selects only the first column

patterns

type j .SDcols by notes
patterns('V1') error + error
!patterns('V1') error + error
-patterns('V1') error + error
patterns('V1','V3') error + error
!patterns('V1','V3') error + error
-patterns('V1','V3') error + error

function

type j .SDcols by notes
is.numeric error + error j has helpful error for some reason
!is.numeric error + error
-is.numeric error + error
function(x) is.numeric(x) error + error j has helpful error for some reason
!function(x) is.numeric(x) error + error
-function(x) is.numeric(x) error + error

calls; char_cols = c("V1", "V3")

type j .SDcols by notes
c(char_cols, "V2") ~ + + j produces char vector
c(..char_cols, "V2") + error error
c(2L, lapply(.SD, '*', 2)) + error error j returns additional column

@jangorecki
Copy link
Member Author

@ColeMiller1 yes, tables are useful, good roadmap for unit tests.

src/utils.c Outdated Show resolved Hide resolved
src/utils.c Outdated Show resolved Hide resolved
src/utils.c Outdated Show resolved Hide resolved
src/utils.c Outdated Show resolved Hide resolved
src/utils.c Outdated Show resolved Hide resolved
src/utils.c Outdated Show resolved Hide resolved
@ColeMiller1 ColeMiller1 mentioned this pull request Feb 2, 2020
4 tasks
*/
SEXP replace_dot_alias(SEXP x) {
if (isLanguage(x) && !isFunction(CAR(x))) {
if (CAR(x) == sym_bquote) // handling `.` inside bquote, #1912
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice (though a bit surprising) that == works here!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe because it is not STRSXP but CHARSXP

@MichaelChirico MichaelChirico mentioned this pull request Feb 2, 2020
@jangorecki jangorecki added the NSE label Feb 4, 2020
@jangorecki
Copy link
Member Author

jangorecki commented Feb 5, 2020

just documenting current behaviour of peeling from (

> x[, c("V1","V3")]
      V1    V3
   <int> <int>
1:     1     3
> x[, ((c("V1","V3")))]
[1] "V1" "V3"
> x[, .SD, .SDcols=c("V1","V3")]
      V1    V3
   <int> <int>
1:     1     3
> x[, .SD, .SDcols=((c("V1","V3")))]
      V1    V3
   <int> <int>
1:     1     3

I think it make sense to keep it like that so extra ( can be used by user to escape our NSE dispatch to column selection. This is handled in 52fed7c#diff-74710e3ecbe94109f39de34133e1c8caR341

@jangorecki
Copy link
Member Author

jangorecki commented Feb 5, 2020

R -q -e 'cc("colselect.Rraw")'

can be used for spotting regression when doing changes for addressing new tests

@jangorecki
Copy link
Member Author

jangorecki commented Feb 11, 2020

Current behaviour of handling peeling from ( might be confusing when combined with !. ( in top call of expression in j escapes the column selection.

> x[, !("V5")]
      V1    V2    V3    V4
   <int> <int> <int> <int>
1:     1     2     3     4
> x[, ("V5")]
[1] "V5"

Just documenting it here.
If you think it should be discussed/changed, please fill a new issue.

R/data.table.R Outdated
@@ -213,8 +203,12 @@ replace_dot_alias = function(e) {
av = NULL
jsub = NULL
if (!missing(j)) {
colselect = getOption("datatable.colselect", FALSE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long would this be an option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will write more on that when PR will be more mature. It has to be an option unfortunately, there are so many corner cases than it is very unlikely to address them maintaining backward compatibility.

@jangorecki jangorecki removed the NSE label Feb 16, 2020
are being used. That was changed recently for consistency to data.frame methods.
\item{by}{\code{character}, \code{integer} or \code{logical} vector indicating
which combinations of columns from \code{x} to use for uniqueness checks. By
default all columns are being used. That was changed recently for consistency to
Copy link
Member

@MichaelChirico MichaelChirico Feb 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we can either remove the last sentence or at least "recently" since it's from >3 years ago:

da5974b

}
e
}
replace_dot_alias = function(e) .Call(Creplace_dot_aliasR, e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may as well move this to wrappers.R

@jangorecki
Copy link
Member Author

This PR attempts to implement C helper function to investigate j and .SDcols arguments. It guesses if column selection was intended (just selecting columns with=FALSE) or it leaves as expression to be evaluated later on in scope of data.table (with=TRUE).
It is unfinished, and as of now I am not planning to work on it further, thus closing. The complexity of handling all cases is tricky enough. Basically there other more useful things I want to allocate my time into. Anyone is welcome to pick it up. What is left is to address requirements defined by tests.

cc("colselect.Rraw")
options(datatable.colselect=TRUE); cc("tests.Rraw")

We could isolate the logic into R helper function instead. Then it will be definitely less complex. On the other hand C implementation gives a fine-grained definition of column selection logic, which is very valuable. Keep in mind that speed is not a concern as we don't operate on data.

@jangorecki jangorecki closed this Feb 17, 2020
@ColeMiller1
Copy link
Contributor

Was the .SDcols mode far enough along? If so, I can work on pushing that portion of the code for C. Otherwise, I will push an R solution - it's a lot easier for me to test and develop.

P.S., All the tests and and work you did was super impressive.

@ColeMiller1 ColeMiller1 mentioned this pull request Feb 18, 2020
1 task
peeled = true;
}
SEXP sym_brackets = install("{");
while (isLanguage(expr) && CAR(expr)==sym_brackets && length(expr)==2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki Is isLanguage equivalent to is.call()? Also, why do the brackets need the length(expr) == 2)? Is that to only peel one layer? Also, I'm not sure for etiquette on closed PRs or if you even receive notifications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the source for is.call

https://github.com/wch/r-source/blob/trunk/src/main/coerce.c#L2072-L2073

    case 300:		/* is.call */
	LOGICAL0(ans)[0] = TYPEOF(CAR(args)) == LANGSXP;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost identical to isLanguage:

https://github.com/wch/r-source/blob/trunk/src/include/Rinlinedfuns.h#L903-L906

INLINE_FUN Rboolean isLanguage(SEXP s)
{
    return (s == R_NilValue || TYPEOF(s) == LANGSXP);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it appears that for how we set up our functions, TYPEOF(s) == LANGSXP would be equivalent of is.call() since I believe args is normally a list for the C arguments. I am asking because I am interested in a more direct %iscall%. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only { needs length check because length(quote({1; 2})) should not be peeled. ( will be always length 2.

SET_TAG(CDDDR(lapply), install("pos"));
SEXP env = eval(lapply, rho);
SEXP evaleval = lang3(install("substitute"), expr, env);
SEXP cols = eval(eval(evaleval, rho), rho); // huh...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SEXP cols = eval(eval(evaleval, rho), rho); // huh...

+1.

So, I understand we need to replace c(..cols1, ..cols2), did you get as far as c(..cols1, lapply(.SD, sum))? I see a little farther you reference the clash between the two modes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so it did worked like this already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants