-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent behavior when combining GForce, non-GForce functions in single j expression #5554
Comments
Thank you for the report. Would be nice to confirm it is reproducible on master branch |
Appears to be reproducible on master branch. library(data.table)
data.table::update_dev_pkg()
#> R data.table package is up-to-date at cb8aeff9453acec878e5ab8515cda0d302c943eb (1.14.7)
options(datatable.verbose = TRUE)
dt <- data.table(group = c("a", "b", "c"),
var1 = c(1L, NA, 2L),
var2 = c(F, F, F))
# Works
dt[, .(max_var1 = max(var1, na.rm = T)), group]
#> Argument 'by' after substitute: group
#> Detected that j uses these columns: [var1]
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.020s elapsed (0.020s cpu)
#> lapply optimization is on, j unchanged as 'list(max(var1, na.rm = T))'
#> GForce optimized j to 'list(gmax(var1, na.rm = TRUE))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.001
#> gforce assign high and low took 0.000
#> gforce eval took 0.000
#> 0.000s elapsed (0.000s cpu)
#> group max_var1
#> <char> <int>
#> 1: a 1
#> 2: b NA
#> 3: c 2
dt[, .(any_var2 = any(var2, na.rm = T)), group]
#> Argument 'by' after substitute: group
#> Detected that j uses these columns: [var2]
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> lapply optimization is on, j unchanged as 'list(any(var2, na.rm = T))'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ...
#> memcpy contiguous groups took 0.000s for 3 groups
#> eval(j) took 0.000s for 3 calls
#> 0.000s elapsed (0.000s cpu)
#> group any_var2
#> <char> <lgcl>
#> 1: a FALSE
#> 2: b FALSE
#> 3: c FALSE
# Breaks
dt[, .(max_var1 = max(var1, na.rm = T),
any_var2 = any(var2, na.rm = T)),
group]
#> Argument 'by' after substitute: group
#> Detected that j uses these columns: [var1, var2]
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> lapply optimization is on, j unchanged as 'list(max(var1, na.rm = T), any(var2, na.rm = T))'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ...
#> Warning in max(var1, na.rm = T): no non-missing arguments to max; returning -Inf
#> Error in `[.data.table`(dt, , .(max_var1 = max(var1, na.rm = T), any_var2 = any(var2, : Column 1 of result for group 2 is type 'double' but expecting type 'integer'. Column types must be consistent for each group.
# Works
dt[, .(max_var1 = max(var1, na.rm = T),
max_var2 = max(var2, na.rm = T)),
group]
#> Argument 'by' after substitute: group
#> Detected that j uses these columns: [var1, var2]
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> lapply optimization is on, j unchanged as 'list(max(var1, na.rm = T), max(var2, na.rm = T))'
#> GForce optimized j to 'list(gmax(var1, na.rm = TRUE), gmax(var2, na.rm = TRUE))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.001
#> gforce assign high and low took 0.000
#> gforce eval took 0.000
#> 0.000s elapsed (0.000s cpu)
#> group max_var1 max_var2
#> <char> <int> <int>
#> 1: a 1 0
#> 2: b NA 0
#> 3: c 2 0
# Without GForce optimization, original command breaks
options(datatable.optimize=0L)
dt[, .(max_var1 = max(var1, na.rm = T)), group]
#> Argument 'by' after substitute: group
#> Detected that j uses these columns: [var1]
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> All optimizations are turned off
#> Making each group and running j (GForce FALSE) ...
#> Warning in max(var1, na.rm = T): no non-missing arguments to max; returning -Inf
#> Error in `[.data.table`(dt, , .(max_var1 = max(var1, na.rm = T)), group): Column 1 of result for group 2 is type 'double' but expecting type 'integer'. Column types must be consistent for each group.
sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.14.7
#>
#> loaded via a namespace (and not attached):
#> [1] withr_2.5.0 digest_0.6.29 lifecycle_1.0.3 magrittr_2.0.3
#> [5] reprex_2.0.2 evaluate_0.17 highr_0.9 stringi_1.7.8
#> [9] rlang_1.0.6 cli_3.4.1 rstudioapi_0.14 fs_1.5.2
#> [13] rmarkdown_2.17 tools_4.2.1 stringr_1.4.1 glue_1.6.2
#> [17] xfun_0.33 yaml_2.3.5 fastmap_1.1.0 compiler_4.2.1
#> [21] htmltools_0.5.3 knitr_1.40 Created on 2022-12-05 with reprex v2.0.2 |
The supposed fix is #5105 in dev version which fixes the |
AIUI that fix is already in the master version in the reprex above? |
Yes, so this issue must then have a different root cause |
Not sure if this is even worth reporting, but a parallel kind of behavior is that due to differences in the behavior of sum/gsum certain functions will behave differently if a grouping is called. The example below is a case where someone tries to get the mean of a set of variables including a non-numeric variable. Without grouping, the mean is I don't know if this variation in sum/gsum (i.e. providing an error rather than a warning) is intentional - if so, I think this can be ignored. I personally prefer the error to the warning in base sum. However, I thought I'd mention just in case there is a goal to make the use of GForce functions totally transparent to the end user. library(data.table)
options(datatable.verbose = TRUE)
dt <- data.table(group = c("a", "b", "c"),
var1 = c(1L, 3L, 2L),
var2 = c(F, F, F),
var3 = c("red", "blue", "green"))
# No grouping: works, provides warning
dt[, lapply(.SD, mean)]
#> Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
#> returning NA
#> Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
#> returning NA
#> group var1 var2 var3
#> 1: NA 2 0 NA
# Grouping: doesn't work: error instead of warning
dt[, lapply(.SD, mean), group]
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> lapply optimization changed j from 'lapply(.SD, mean)' to 'list(mean(var1), mean(var2), mean(var3))'
#> GForce optimized j to 'list(gmean(var1), gmean(var2), gmean(var3))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> This gsum took (narm=FALSE) ... gather took ... 0.000s
#> 0.000s
#> This gsum took (narm=FALSE) ... gather took ... 0.000s
#> 0.000s
#> This gsum took (narm=FALSE) ...
#> Error in gmean(var3): Type 'character' not supported by GForce sum (gsum). Either add the prefix base::sum(.) or turn off GForce optimization using options(datatable.optimize=1)
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.14.8
#>
#> loaded via a namespace (and not attached):
#> [1] withr_2.5.0 digest_0.6.30 lifecycle_1.0.3 magrittr_2.0.3
#> [5] reprex_2.0.2 evaluate_0.17 highr_0.9 stringi_1.7.8
#> [9] rlang_1.1.0 cli_3.6.1 rstudioapi_0.14 fs_1.6.1
#> [13] rmarkdown_2.17 tools_4.2.2 stringr_1.4.1 glue_1.6.2
#> [17] xfun_0.38 yaml_2.3.6 fastmap_1.1.0 compiler_4.2.2
#> [21] htmltools_0.5.5 knitr_1.40 Created on 2023-04-13 with reprex v2.0.2 |
It seems likely to me that this behavior has already been reported, but I was unable to find an issue for it.
Basically, it appears that differences in how
gmax
handles NAs vs. howmax
handles NAs means that two operations inj
which perform fine on their own (in this case, calls tomax
andany
) might throw an error when called together when creating the same two columns in a singlej
. I have a minimal example below.My understanding is that because
any
is not GForce optimized, when it appears inj
withmax
, we will call the standard max function. In the case where no members of the group have non-NA values this will return -Inf, a double; and for all other cases it will return an integer. Meanwhile, gmax seems to recognize this problem and coerce the integer groups to double.I think this could lead to confusion as the output of one function is determined by the presence of another.
Created on 2022-12-05 with reprex v2.0.2
The text was updated successfully, but these errors were encountered: