Inconsistent behavior when combining GForce, non-GForce functions in single j expression #5554

berg-michael · 2022-12-05T19:19:50Z

It seems likely to me that this behavior has already been reported, but I was unable to find an issue for it.

Basically, it appears that differences in how gmax handles NAs vs. how max handles NAs means that two operations in j which perform fine on their own (in this case, calls to max and any) might throw an error when called together when creating the same two columns in a single j. I have a minimal example below.

My understanding is that because any is not GForce optimized, when it appears in j with max, we will call the standard max function. In the case where no members of the group have non-NA values this will return -Inf, a double; and for all other cases it will return an integer. Meanwhile, gmax seems to recognize this problem and coerce the integer groups to double.

I think this could lead to confusion as the output of one function is determined by the presence of another.

library(data.table)
options(datatable.verbose = TRUE)

dt <- data.table(group = c("a", "b", "c"),
                 var1 = c(1L, NA, 2L),
                 var2 = c(F, F, F))
# Works
dt[, .(max_var1 = max(var1, na.rm = T)), group]
#> Detected that j uses these columns: var1 
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(max(var1, na.rm = T))'
#> GForce optimized j to 'list(gmax(var1, na.rm = TRUE))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> Warning in gmax(var1, na.rm = TRUE): No non-missing values found in at least
#> one group. Coercing to numeric type and returning 'Inf' for such groups to be
#> consistent with base
#> gforce eval took 0.000
#> 0.001s elapsed (0.000s cpu)
#>    group max_var1
#> 1:     a        1
#> 2:     b     -Inf
#> 3:     c        2
dt[, .(any_var2 = any(var2, na.rm = T)), group]
#> Detected that j uses these columns: var2 
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(any(var2, na.rm = T))'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... 
#>   memcpy contiguous groups took 0.000s for 3 groups
#>   eval(j) took 0.000s for 3 calls
#> 0.000s elapsed (0.000s cpu)
#>    group any_var2
#> 1:     a    FALSE
#> 2:     b    FALSE
#> 3:     c    FALSE

# Breaks
dt[, .(max_var1 = max(var1, na.rm = T),
       any_var2 = any(var2, na.rm = T)),
   group]
#> Detected that j uses these columns: var1,var2 
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(max(var1, na.rm = T), any(var2, na.rm = T))'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ...
#> Warning in max(var1, na.rm = T): no non-missing arguments to max; returning -Inf
#> Error in `[.data.table`(dt, , .(max_var1 = max(var1, na.rm = T), any_var2 = any(var2, : Column 1 of result for group 2 is type 'double' but expecting type 'integer'. Column types must be consistent for each group.

# Works
dt[, .(max_var1 = max(var1, na.rm = T),
       max_var2 = max(var2, na.rm = T)),
   group]
#> Detected that j uses these columns: var1,var2 
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(max(var1, na.rm = T), max(var2, na.rm = T))'
#> GForce optimized j to 'list(gmax(var1, na.rm = TRUE), gmax(var2, na.rm = TRUE))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> Warning in gmax(var1, na.rm = TRUE): No non-missing values found in at least
#> one group. Coercing to numeric type and returning 'Inf' for such groups to be
#> consistent with base
#> gforce eval took 0.000
#> 0.000s elapsed (0.000s cpu)
#>    group max_var1 max_var2
#> 1:     a        1        0
#> 2:     b     -Inf        0
#> 3:     c        2        0

# Without GForce optimization, original command breaks
options(datatable.optimize=0L)
dt[, .(max_var1 = max(var1, na.rm = T)), group]
#> Detected that j uses these columns: var1 
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> All optimizations are turned off
#> Making each group and running j (GForce FALSE) ...
#> Warning in max(var1, na.rm = T): no non-missing arguments to max; returning -Inf
#> Error in `[.data.table`(dt, , .(max_var1 = max(var1, na.rm = T)), group): Column 1 of result for group 2 is type 'double' but expecting type 'integer'. Column types must be consistent for each group.

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.14.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] withr_2.5.0     digest_0.6.30   lifecycle_1.0.3 magrittr_2.0.3 
#>  [5] reprex_2.0.2    evaluate_0.17   highr_0.9       stringi_1.7.8  
#>  [9] rlang_1.0.6     cli_3.4.1       rstudioapi_0.14 fs_1.5.2       
#> [13] rmarkdown_2.17  tools_4.2.2     stringr_1.4.1   glue_1.6.2     
#> [17] xfun_0.34       yaml_2.3.6      fastmap_1.1.0   compiler_4.2.2 
#> [21] htmltools_0.5.3 knitr_1.40

^{Created on 2022-12-05 with reprex v2.0.2}

The text was updated successfully, but these errors were encountered:

jangorecki · 2022-12-05T20:33:25Z

Thank you for the report. Would be nice to confirm it is reproducible on master branch

berg-michael · 2022-12-05T20:45:07Z

Appears to be reproducible on master branch.

library(data.table)
data.table::update_dev_pkg()
#> R data.table package is up-to-date at cb8aeff9453acec878e5ab8515cda0d302c943eb (1.14.7)
options(datatable.verbose = TRUE)

dt <- data.table(group = c("a", "b", "c"),
                 var1 = c(1L, NA, 2L),
                 var2 = c(F, F, F))
# Works
dt[, .(max_var1 = max(var1, na.rm = T)), group]
#> Argument 'by' after substitute: group
#> Detected that j uses these columns: [var1]
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.020s elapsed (0.020s cpu) 
#> lapply optimization is on, j unchanged as 'list(max(var1, na.rm = T))'
#> GForce optimized j to 'list(gmax(var1, na.rm = TRUE))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.001
#> gforce assign high and low took 0.000
#> gforce eval took 0.000
#> 0.000s elapsed (0.000s cpu)
#>     group max_var1
#>    <char>    <int>
#> 1:      a        1
#> 2:      b       NA
#> 3:      c        2
dt[, .(any_var2 = any(var2, na.rm = T)), group]
#> Argument 'by' after substitute: group
#> Detected that j uses these columns: [var2]
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(any(var2, na.rm = T))'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... 
#>   memcpy contiguous groups took 0.000s for 3 groups
#>   eval(j) took 0.000s for 3 calls
#> 0.000s elapsed (0.000s cpu)
#>     group any_var2
#>    <char>   <lgcl>
#> 1:      a    FALSE
#> 2:      b    FALSE
#> 3:      c    FALSE

# Breaks
dt[, .(max_var1 = max(var1, na.rm = T),
       any_var2 = any(var2, na.rm = T)),
   group]
#> Argument 'by' after substitute: group
#> Detected that j uses these columns: [var1, var2]
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(max(var1, na.rm = T), any(var2, na.rm = T))'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ...
#> Warning in max(var1, na.rm = T): no non-missing arguments to max; returning -Inf
#> Error in `[.data.table`(dt, , .(max_var1 = max(var1, na.rm = T), any_var2 = any(var2, : Column 1 of result for group 2 is type 'double' but expecting type 'integer'. Column types must be consistent for each group.

# Works
dt[, .(max_var1 = max(var1, na.rm = T),
       max_var2 = max(var2, na.rm = T)),
   group]
#> Argument 'by' after substitute: group
#> Detected that j uses these columns: [var1, var2]
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(max(var1, na.rm = T), max(var2, na.rm = T))'
#> GForce optimized j to 'list(gmax(var1, na.rm = TRUE), gmax(var2, na.rm = TRUE))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.001
#> gforce assign high and low took 0.000
#> gforce eval took 0.000
#> 0.000s elapsed (0.000s cpu)
#>     group max_var1 max_var2
#>    <char>    <int>    <int>
#> 1:      a        1        0
#> 2:      b       NA        0
#> 3:      c        2        0

# Without GForce optimization, original command breaks
options(datatable.optimize=0L)
dt[, .(max_var1 = max(var1, na.rm = T)), group]
#> Argument 'by' after substitute: group
#> Detected that j uses these columns: [var1]
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> All optimizations are turned off
#> Making each group and running j (GForce FALSE) ...
#> Warning in max(var1, na.rm = T): no non-missing arguments to max; returning -Inf
#> Error in `[.data.table`(dt, , .(max_var1 = max(var1, na.rm = T)), group): Column 1 of result for group 2 is type 'double' but expecting type 'integer'. Column types must be consistent for each group.

sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.14.7
#> 
#> loaded via a namespace (and not attached):
#>  [1] withr_2.5.0     digest_0.6.29   lifecycle_1.0.3 magrittr_2.0.3 
#>  [5] reprex_2.0.2    evaluate_0.17   highr_0.9       stringi_1.7.8  
#>  [9] rlang_1.0.6     cli_3.4.1       rstudioapi_0.14 fs_1.5.2       
#> [13] rmarkdown_2.17  tools_4.2.1     stringr_1.4.1   glue_1.6.2     
#> [17] xfun_0.33       yaml_2.3.5      fastmap_1.1.0   compiler_4.2.1 
#> [21] htmltools_0.5.3 knitr_1.40

^{Created on 2022-12-05 with reprex v2.0.2}

ben-schwen · 2022-12-05T23:18:06Z

The supposed fix is #5105 in dev version which fixes the GForce behavior

berg-michael · 2022-12-06T17:47:09Z

AIUI that fix is already in the master version in the reprex above?

jangorecki · 2022-12-06T21:07:28Z

Yes, so this issue must then have a different root cause

berg-michael · 2023-04-13T19:49:02Z

Not sure if this is even worth reporting, but a parallel kind of behavior is that due to differences in the behavior of sum/gsum certain functions will behave differently if a grouping is called.

The example below is a case where someone tries to get the mean of a set of variables including a non-numeric variable.

Without grouping, the mean is NA for the non-numeric variable and a warning is printed. With grouping, execution does not occur and an error is printed.

I don't know if this variation in sum/gsum (i.e. providing an error rather than a warning) is intentional - if so, I think this can be ignored. I personally prefer the error to the warning in base sum.

However, I thought I'd mention just in case there is a goal to make the use of GForce functions totally transparent to the end user.

library(data.table)

options(datatable.verbose = TRUE)
dt <- data.table(group = c("a", "b", "c"),
                 var1 = c(1L, 3L, 2L),
                 var2 = c(F, F, F),
                 var3 = c("red", "blue", "green"))
# No grouping: works, provides warning
dt[, lapply(.SD, mean)]
#> Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
#> returning NA

#> Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
#> returning NA
#>    group var1 var2 var3
#> 1:    NA    2    0   NA

# Grouping: doesn't work: error instead of warning
dt[, lapply(.SD, mean), group]
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization changed j from 'lapply(.SD, mean)' to 'list(mean(var1), mean(var2), mean(var3))'
#> GForce optimized j to 'list(gmean(var1), gmean(var2), gmean(var3))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> This gsum took (narm=FALSE) ... gather took ... 0.000s
#> 0.000s
#> This gsum took (narm=FALSE) ... gather took ... 0.000s
#> 0.000s
#> This gsum took (narm=FALSE) ...
#> Error in gmean(var3): Type 'character' not supported by GForce sum (gsum). Either add the prefix base::sum(.) or turn off GForce optimization using options(datatable.optimize=1)
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.14.8
#> 
#> loaded via a namespace (and not attached):
#>  [1] withr_2.5.0     digest_0.6.30   lifecycle_1.0.3 magrittr_2.0.3 
#>  [5] reprex_2.0.2    evaluate_0.17   highr_0.9       stringi_1.7.8  
#>  [9] rlang_1.1.0     cli_3.6.1       rstudioapi_0.14 fs_1.6.1       
#> [13] rmarkdown_2.17  tools_4.2.2     stringr_1.4.1   glue_1.6.2     
#> [17] xfun_0.38       yaml_2.3.6      fastmap_1.1.0   compiler_4.2.2 
#> [21] htmltools_0.5.5 knitr_1.40

^{Created on 2023-04-13 with reprex v2.0.2}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior when combining GForce, non-GForce functions in single j expression #5554

Inconsistent behavior when combining GForce, non-GForce functions in single j expression #5554

berg-michael commented Dec 5, 2022

jangorecki commented Dec 5, 2022 •

edited

Loading

berg-michael commented Dec 5, 2022

ben-schwen commented Dec 5, 2022 •

edited

Loading

berg-michael commented Dec 6, 2022

jangorecki commented Dec 6, 2022

berg-michael commented Apr 13, 2023

Inconsistent behavior when combining GForce, non-GForce functions in single j expression #5554

Inconsistent behavior when combining GForce, non-GForce functions in single j expression #5554

Comments

berg-michael commented Dec 5, 2022

jangorecki commented Dec 5, 2022 • edited Loading

berg-michael commented Dec 5, 2022

ben-schwen commented Dec 5, 2022 • edited Loading

berg-michael commented Dec 6, 2022

jangorecki commented Dec 6, 2022

berg-michael commented Apr 13, 2023

jangorecki commented Dec 5, 2022 •

edited

Loading

ben-schwen commented Dec 5, 2022 •

edited

Loading