Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lapply on .SD reorder columns when using get() in .SD #4089

Closed
dmongin opened this issue Dec 4, 2019 · 8 comments · Fixed by #4212
Closed

lapply on .SD reorder columns when using get() in .SD #4089

dmongin opened this issue Dec 4, 2019 · 8 comments · Fixed by #4212
Labels
Milestone

Comments

@dmongin
Copy link

dmongin commented Dec 4, 2019

I found a strange behavior of data.table. I found a situation where lapply applied on .SD changes the order of the column, messing then when one intend to assign the result.
An example:

Here the normal behavior

library(data.table)
plouf <- data.table(x = 1, y = 2, z = 3)
cols <- c("y","x")
plouf[,.SD,.SDcols = cols ,by = z]
plouf[,lapply(.SD,function(x){x}),.SDcols = cols ,by = z]
plouf[,lapply(.SD[x == 1],function(x){x}),.SDcols = cols ,by = z]

All these lines give :

   z y x
1: 3 2 1

which I need for example to reassign to c("y","x"). But if I do:

plouf[,lapply(.SD[get("x") == 1],function(x){x}),.SDcols = c("y","x"),by = z]

   z x y
1: 3 1 2

Here the order of x and y changed without reason, when it should yield the same result as the last "working" example. If then assign the wrong values to c("y","x") if I assign the output of lapply to new vector of columns. It seems that the use of get in the i part of .SD triggers this bug.

Example of the effect of this on assignment:

plouf[, c(cols ) := lapply(.SD[get("x") == 1],function(x){x}),
      .SDcols = cols ,by = z][]
#    x y z
# 1: 2 1 3

# Output of sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=French_Switzerland.1252 LC_CTYPE=French_Switzerland.1252
[3] LC_MONETARY=French_Switzerland.1252 LC_NUMERIC=C
[5] LC_TIME=French_Switzerland.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ggplot2_3.2.1 lmerTest_3.1-0 lme4_1.1-21 Matrix_1.2-17
[5] plyr_1.8.4 lubridate_1.7.4 readxl_1.3.1 data.table_1.12.2
[9] stringr_1.4.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 cellranger_1.1.0 compiler_3.6.1
[4] pillar_1.4.2 nloptr_1.2.1 tools_3.6.1
[7] digest_0.6.21 boot_1.3-22 tibble_2.1.3
[10] gtable_0.3.0 nlme_3.1-140 lattice_0.20-38
[13] pkgconfig_2.0.3 rlang_0.4.0 rstudioapi_0.10
[16] withr_2.1.2 dplyr_0.8.3 grid_3.6.1
[19] tidyselect_0.2.5 glue_1.3.1 R6_2.4.0
[22] minqa_1.2.4 reshape2_1.4.3 purrr_0.3.2
[25] magrittr_1.5 scales_1.0.0 MASS_7.3-51.4
[28] splines_3.6.1 assertthat_0.2.1 colorspace_1.4-1
[31] numDeriv_2016.8-1.1 labeling_0.3 stringi_1.4.3
[34] lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4

@jangorecki
Copy link
Member

jangorecki commented Dec 4, 2019

@dmongin thank you for detailed report. I can reproduce it on recent devel. As a workaround you can use now substitute rather than get

var = "x"
expr = substitute(
  plouf[, c(cols) := lapply(.SD[.var == 1],function(x){x}), .SDcols = cols, by = z][],
  list(.var=as.name(var))
)
print(expr)
#plouf[, `:=`(c(cols), lapply(.SD[x == 1], function(x) {
#    x
#})), .SDcols = cols, by = z][]
eval(expr)
#   x y z
#1: 2 1 3

Personally I would use it regularly, not as a workaround, I find R metaprogramming features superior.
Also be aware that some day instead of get(var) we should be able to use ..var, #2816, #3199
R metaprogramming always worked and, I assume, will always work, thanks to the conservative backward compatible R code development.

@jangorecki jangorecki added the bug label Dec 4, 2019
@MichaelChirico
Copy link
Member

Note this is sort of covered by the verbose message:

options(datatable.verbose=TRUE)
plouf[...]
'(m)get' found in j. ansvars being set to all columns. Use .SDcols or a single j=eval(macro) instead. Both will detect the columns used which is important for efficiency.
Old: y,x 
New: x,y

The issue is the message is a bit misleading/slightly confusing. Actually you've used get in a nested [ call, whereas this behavior is designed for get alongside .SD in the same [ call.

I'm not sure how useful it is to create fully robust tree parsing for edge cases like this that are better suited by metaprogramming as Jan suggested, especially given that the order of columns shouldn't really matter all that much.

Will leave this open for now to keep the discussion going a bit.

@jangorecki
Copy link
Member

order does matter if you provide columns to assign on LHS of :=, then you end up with incorrectly labelled columns

@dmongin
Copy link
Author

dmongin commented Dec 4, 2019

@MichaelChirico as @jangorecki suggested, I discovered the problem in wrongly assigned column in my code (that I assigned on LHS with :=), and it took me a while to traceback the error. There could be cases were the user does not notice it. I find it quite dangerous, so I would vote for something robust.

I did not know about the metaprogramming, and I am not sure to see the advantage compared to get() (apart from avoiding this bug, but I find the code more hard to read and to use). If use of ..var avoid the bug, maybe the use of get() should be depreciated ?

@dmongin
Copy link
Author

dmongin commented Dec 4, 2019

@jangorecki
Copy link
Member

jangorecki commented Dec 5, 2019

@dmongin best place to get proper information on metaprogramming is official R language manual, there is a dedicated chapter to this feature: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Computing-on-the-language
Advantage is that you can construct any query you need to. If you are coding parametrized interface to some API (like data.table) you can construct parametrized queries as you would type the values of parameters by hand, and evaluate those. There are multiple packages that meant to help user to do metaprogramming in R introducing various ways to handle substitution ~ var, !!var, !!!var, {{ var }}}. I haven't use them really because base R metaprogramming works well, it is also the most reliable way, backward compatible, future proof, least prone to bugs.

@ColeMiller1
Copy link
Contributor

Looking into it, changing this:

if (any(c("get", "mget") %chin% av)) {

To:

if (any(c("get", "mget") %chin% av) && !(all(names_x %in% c(ansvars, allbyvars)) && is.null(names_i))) {

Would address this issue. Test 717 would need to be changed but all other tests check out for me when making the change.

The logic is that if all of the names of the data.table are already accounted for in our .SD and by and there are no names in i (meaning it's not a join), we wouldn't reorder the columns. Joins could be accounted for as well but it seems unlikely to have a join, :=, get(), and cols to update.

@ColeMiller1
Copy link
Contributor

I am working on a PR to refactor the code to move the m(get) statement up a little bit, per the code comment:

data.table/R/data.table.R

Lines 991 to 994 in c005296

# TODO remove as (m)get is now folded in above.
# added 'mget' - fix for #994
if (any(c("get", "mget") %chin% av)) {
if (verbose) {

What is the desired behaviour? Should there be a warning to users who use := and (m)get() or should we prevent the column switching from happening when all variables are already accounted for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants