melt(na.rm=TRUE) should remove rows with missing list column #5053

tdhock · 2021-06-25T15:38:00Z

I found another edge case worth fixing for melt on data with a missing list column. Here is a data table with "missing" input columns (l_2 and num_3).

> (DT_missing_l_2 <- data.table(num_1=1, num_2=2, l_1=list(1), l_3=list(3)))
   num_1 num_2    l_1    l_3
   <num> <num> <list> <list>
1:     1     2      1      3

When we melt with na.rm=F we see that third row correctly contains NA but second row in the list column contains NULL (which is NOT treated as missing, inconsistent with other types).

> (melt.F <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=F))
     char   num      l
   <char> <num> <list>
1:      1     1      1
2:      2     2       
3:      3    NA      3
> str(melt.F)
Classes ‘data.table’ and 'data.frame':	3 obs. of  3 variables:
 $ char: chr  "1" "2" "3"
 $ num : num  1 2 NA
 $ l   :List of 3
  ..$ : num 1
  ..$ : NULL
  ..$ : num 3
 - attr(*, ".internal.selfref")=<externalptr>

Then when we melt with na.rm=T we see that third row is removed (correct) but second row is kept (incorrect),

> (melt.T <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=T))
     char   num      l
   <char> <num> <list>
1:      1     1      1
2:      2     2

I will investigate a fix.

tdhock · 2021-06-25T16:29:15Z

hi again my fix involves a small change to writeNA in assign.c -- a logical NA scalar is used instead of NULL/R_NilValue for each list/VECSXP element. The example code above now yields:

> (DT_missing_l_2 <- data.table(num_1=1, num_2=2, l_1=list(1), l_3=list(3)))
   num_1 num_2    l_1    l_3
   <num> <num> <list> <list>
1:     1     2      1      3
> (melt.F <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=F))
     char   num      l
   <char> <num> <list>
1:      1     1      1
2:      2     2     NA
3:      3    NA      3
> (melt.T <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=T))
     char   num      l
   <char> <num> <list>
1:      1     1      1

There were several tests which were affected by this (plonking empty list column, etc). Previously they expected NULL, but I had to change that to NA, is that OK?

> melt.T[, plonk := list()][]
     char   num      l  plonk
   <char> <num> <list> <list>
1:      1     1      1     NA

… columns

mattdowle · 2021-06-26T14:53:09Z

inst/tests/tests.Rraw

 # Test plonk list variable (to catch deparse treating j=list() specially)
 x = list(2,"b",2.718)
-test(470, DT[,baz:=x], data.table(a=1:3,b=4:6,foo=list(),bar=list(1,"a",3.14),baz=list(2,"b",2.718)))
+test(470, DT[,baz:=x], data.table(a=1:3,b=4:6,foo=list(NA,NA,NA),bar=list(1,"a",3.14),baz=list(2,"b",2.718)))


The change to melt looks good and I was about to merge. But then, yes as you mentioned, these test changes to list columns in DT[...] too is causing me to hesitate. If we go ahead then the news item would need to mention it in an example in an R code block, in a potentially-breaking-changes (PBC) section at the top, and we'd bump to 1.15.0 to convey the PBC. Maybe also a consultation exercise first on Twitter to gauge user opinion. I suspect the outcome would be positive to go ahead. I can't think of any downsides really other than it being a change (caveat internals, see comment below).
dev is currently passing all 1,075 revdeps (I did a full rerun yesterday) so we could release this one soon (1.14.2) and then this PR would go into dev to be become 1.15.0. I could also do a revdep rerun with this PR and see which (if any) revdeps break; either way it would be useful to know.

sure, either way that sounds good.

mattdowle · 2021-06-26T15:20:59Z

src/assign.c

+      SEXP na_scalar = allocVector(LGLSXP, 1);
+      LOGICAL(na_scalar)[0] = NA_LOGICAL;
+      SET_VECTOR_ELT(v, i, na_scalar);
+    }


I was going to suggest, if we go ahead with this change, using R_LogicalNAValue here. That's R's internal constant object to save the multiple allocations (R has R_TrueValue and R_FalseValue too). However, then I remembered packages don't have access to those internals. We can use ScalarLogical(NA_LOGICAL), though, and that returns R_LogicalNAValue without allocating. So this loop would become :

SEXP na = ScalarLogical(NA_LOGICAL); for (int i=from; i<=to; ++i) SET_VECTOR_ELT(v, i, na);

However, lets say := or set was then used to write FALSE into some of those list column cells containing NA. That would have the potential to change R's global constant (that has happened before).
However2, we're talking about items within a list column cell here, and I suspect := and set don't have the ability to update that nested level. What happened before was with 1-row data.table's containing a logical column: that one-row logical column could be R_[True|False|LogicalNA]Value and a := on such a one-row column did corrupt R's internal value. So we catch that now by reallocating 1-row logical columns.
However3, if := and set don't already detect and prevent writing to R_[True|False|LogicalNA]Value, we should put that in regardless (they could then either change the pointer rather than the contents of the length-1 logical, or they could allocate a new length-1 logical if an attribute is being attached by user.)
So, in summary, after writing out loud here, I would rule out this loop as it stands now for the reason of allocating all those length-1 objects could gobble memory in large cases. Which is maybe why we used empty list (R_NilValue). If we go ahead with the NULL to NA change, we could review set() and := w.r.t. length-1 logicals, and if ok, use ScalarLogical(NA_LOGICAL) here.
Maybe even the recent att="t" attached to R_FalseValue problem could be resolved in a different more robust way inside set/:= rather than what I did.

I was thinking about the potential overhead in allocating a lot of length-1 logical vectors as well. I agree that using R_LogicalNA seems like a great / more efficient alternative. I did not know that := and set could change that, but yes I agree that seems like something := / set should detect/handle, and that should not stop us from using a more efficient code here.

mattdowle · 2021-06-26T16:27:54Z

I merged #5054 into main, and then merged main into this PR, so #5054 is included in this branch (I noticed it had a writeNA).
Starting a full rerun of revdeps now on this branch ...

mattdowle · 2021-06-26T23:54:25Z

Just 2 CRAN revdeps would fail : bbotk and eplusr.
(tidytransit warning is unrelated and happens currently on CRAN with 1.14.0)

CRAN:
 ERROR   :    2 : bbotk eplusr 
 WARNING :    1 : tidytransit 
 NOTE    :  357 
 OK      :  715 
 TOTAL   : 1075 / 1075

bbotk_0.3.2.tar.gz
./bbotk.Rcheck/00check.log:* using log directory ‘/home/mdowle/build/revdeplib/bbotk.Rcheck’
./bbotk.Rcheck/00check.log:* using R version 4.0.3 (2020-10-10)
./bbotk.Rcheck/00check.log:* using platform: x86_64-pc-linux-gnu (64-bit)
./bbotk.Rcheck/00check.log:* using session charset: UTF-8
./bbotk.Rcheck/00check.log:* checking for file ‘bbotk/DESCRIPTION’ ... OK
./bbotk.Rcheck/00check.log:* this is package ‘bbotk’ version ‘0.3.2’
./bbotk.Rcheck/00check.log:* package encoding: UTF-8
./bbotk.Rcheck/00check.log:* checking package namespace information ... OK
./bbotk.Rcheck/00check.log:* checking package dependencies ... OK
./bbotk.Rcheck/00check.log:* checking if this is a source package ... OK
./bbotk.Rcheck/00check.log:* checking if there is a namespace ... OK
./bbotk.Rcheck/00check.log:* checking for executable files ... OK
./bbotk.Rcheck/00check.log:* checking for hidden files and directories ... OK
./bbotk.Rcheck/00check.log:* checking for portable file names ... OK
./bbotk.Rcheck/00check.log:* checking for sufficient/correct file permissions ... OK
./bbotk.Rcheck/00check.log:* checking whether package ‘bbotk’ can be installed ... OK
./bbotk.Rcheck/00check.log:* checking installed package size ... OK
./bbotk.Rcheck/00check.log:* checking package directory ... OK
./bbotk.Rcheck/00check.log:* checking ‘build’ directory ... OK
./bbotk.Rcheck/00check.log:* checking DESCRIPTION meta-information ... OK
./bbotk.Rcheck/00check.log:* checking top-level files ... OK
./bbotk.Rcheck/00check.log:* checking for left-over files ... OK
./bbotk.Rcheck/00check.log:* checking index information ... OK
./bbotk.Rcheck/00check.log:* checking package subdirectories ... OK
./bbotk.Rcheck/00check.log:* checking R files for non-ASCII characters ... OK
./bbotk.Rcheck/00check.log:* checking R files for syntax errors ... OK
./bbotk.Rcheck/00check.log:* checking whether the package can be loaded ... OK
./bbotk.Rcheck/00check.log:* checking whether the package can be loaded with stated dependencies ... OK
./bbotk.Rcheck/00check.log:* checking whether the package can be unloaded cleanly ... OK
./bbotk.Rcheck/00check.log:* checking whether the namespace can be loaded with stated dependencies ... OK
./bbotk.Rcheck/00check.log:* checking whether the namespace can be unloaded cleanly ... OK
./bbotk.Rcheck/00check.log:* checking loading without being on the library search path ... OK
./bbotk.Rcheck/00check.log:* checking dependencies in R code ... OK
./bbotk.Rcheck/00check.log:* checking S3 generic/method consistency ... OK
./bbotk.Rcheck/00check.log:* checking replacement functions ... OK
./bbotk.Rcheck/00check.log:* checking foreign function calls ... OK
./bbotk.Rcheck/00check.log:* checking R code for possible problems ... OK
./bbotk.Rcheck/00check.log:* checking Rd files ... OK
./bbotk.Rcheck/00check.log:* checking Rd metadata ... OK
./bbotk.Rcheck/00check.log:* checking Rd cross-references ... OK
./bbotk.Rcheck/00check.log:* checking for missing documentation entries ... OK
./bbotk.Rcheck/00check.log:* checking for code/documentation mismatches ... OK
./bbotk.Rcheck/00check.log:* checking Rd \usage sections ... OK
./bbotk.Rcheck/00check.log:* checking Rd contents ... OK
./bbotk.Rcheck/00check.log:* checking for unstated dependencies in examples ... OK
./bbotk.Rcheck/00check.log:* checking line endings in C/C++/Fortran sources/headers ... OK
./bbotk.Rcheck/00check.log:* checking compiled code ... OK
./bbotk.Rcheck/00check.log:* checking installed files from ‘inst/doc’ ... OK
./bbotk.Rcheck/00check.log:* checking files in ‘vignettes’ ... OK
./bbotk.Rcheck/00check.log:* checking examples ... OK
./bbotk.Rcheck/00check.log:* checking for unstated dependencies in ‘tests’ ... OK
./bbotk.Rcheck/00check.log:* checking tests ... ERROR
./bbotk.Rcheck/00check.log:  Running ‘testthat.R’
./bbotk.Rcheck/00check.log:Running the tests in ‘tests/testthat.R’ failed.
./bbotk.Rcheck/00check.log:Last 13 lines of output:
./bbotk.Rcheck/00check.log:  ── Failure (test_OptimInstanceSingleCrit.R:64:3): OptimInstance works with extras input ──
./bbotk.Rcheck/00check.log:  inst$archive$data$extra3[1:3] (`actual`) not equal to list(NULL, NULL, NULL) (`expected`).
./bbotk.Rcheck/00check.log:  
./bbotk.Rcheck/00check.log:  `actual[[1]]` is a logical vector (NA)
./bbotk.Rcheck/00check.log:  `expected[[1]]` is NULL
./bbotk.Rcheck/00check.log:  
./bbotk.Rcheck/00check.log:  `actual[[2]]` is a logical vector (NA)
./bbotk.Rcheck/00check.log:  `expected[[2]]` is NULL
./bbotk.Rcheck/00check.log:  
./bbotk.Rcheck/00check.log:  `actual[[3]]` is a logical vector (NA)
./bbotk.Rcheck/00check.log:  `expected[[3]]` is NULL
./bbotk.Rcheck/00check.log:  
./bbotk.Rcheck/00check.log:  [ FAIL 1 | WARN 6 | SKIP 1 | PASS 386 ]
./bbotk.Rcheck/00check.log:  Error: Test failures
./bbotk.Rcheck/00check.log:  Execution halted
./bbotk.Rcheck/00check.log:* checking for unstated dependencies in vignettes ... OK
./bbotk.Rcheck/00check.log:* checking package vignettes in ‘inst/doc’ ... OK
./bbotk.Rcheck/00check.log:* checking running R code from vignettes ... NONE
./bbotk.Rcheck/00check.log:  ‘bbotk.Rmd’ using ‘UTF-8’... OK
./bbotk.Rcheck/00check.log:* checking re-building of vignette outputs ... OK
./bbotk.Rcheck/00check.log:* checking PDF version of manual ... OK
./bbotk.Rcheck/00check.log:* DONE
./bbotk.Rcheck/00check.log:Status: 1 ERROR


eplusr_0.14.2.tar.gz
./eplusr.Rcheck/00check.log:* using log directory ‘/home/mdowle/build/revdeplib/eplusr.Rcheck’
./eplusr.Rcheck/00check.log:* using R version 4.0.3 (2020-10-10)
./eplusr.Rcheck/00check.log:* using platform: x86_64-pc-linux-gnu (64-bit)
./eplusr.Rcheck/00check.log:* using session charset: UTF-8
./eplusr.Rcheck/00check.log:* checking for file ‘eplusr/DESCRIPTION’ ... OK
./eplusr.Rcheck/00check.log:* this is package ‘eplusr’ version ‘0.14.2’
./eplusr.Rcheck/00check.log:* package encoding: UTF-8
./eplusr.Rcheck/00check.log:* checking package namespace information ... OK
./eplusr.Rcheck/00check.log:* checking package dependencies ... OK
./eplusr.Rcheck/00check.log:* checking if this is a source package ... OK
./eplusr.Rcheck/00check.log:* checking if there is a namespace ... OK
./eplusr.Rcheck/00check.log:* checking for executable files ... OK
./eplusr.Rcheck/00check.log:* checking for hidden files and directories ... OK
./eplusr.Rcheck/00check.log:* checking for portable file names ... OK
./eplusr.Rcheck/00check.log:* checking for sufficient/correct file permissions ... OK
./eplusr.Rcheck/00check.log:* checking whether package ‘eplusr’ can be installed ... OK
./eplusr.Rcheck/00check.log:* checking installed package size ... OK
./eplusr.Rcheck/00check.log:* checking package directory ... OK
./eplusr.Rcheck/00check.log:* checking DESCRIPTION meta-information ... OK
./eplusr.Rcheck/00check.log:* checking top-level files ... OK
./eplusr.Rcheck/00check.log:* checking for left-over files ... OK
./eplusr.Rcheck/00check.log:* checking index information ... OK
./eplusr.Rcheck/00check.log:* checking package subdirectories ... OK
./eplusr.Rcheck/00check.log:* checking R files for non-ASCII characters ... OK
./eplusr.Rcheck/00check.log:* checking R files for syntax errors ... OK
./eplusr.Rcheck/00check.log:* checking whether the package can be loaded ... OK
./eplusr.Rcheck/00check.log:* checking whether the package can be loaded with stated dependencies ... OK
./eplusr.Rcheck/00check.log:* checking whether the package can be unloaded cleanly ... OK
./eplusr.Rcheck/00check.log:* checking whether the namespace can be loaded with stated dependencies ... OK
./eplusr.Rcheck/00check.log:* checking whether the namespace can be unloaded cleanly ... OK
./eplusr.Rcheck/00check.log:* checking loading without being on the library search path ... OK
./eplusr.Rcheck/00check.log:* checking dependencies in R code ... OK
./eplusr.Rcheck/00check.log:* checking S3 generic/method consistency ... OK
./eplusr.Rcheck/00check.log:* checking replacement functions ... OK
./eplusr.Rcheck/00check.log:* checking foreign function calls ... OK
./eplusr.Rcheck/00check.log:* checking R code for possible problems ... OK
./eplusr.Rcheck/00check.log:* checking Rd files ... OK
./eplusr.Rcheck/00check.log:* checking Rd metadata ... OK
./eplusr.Rcheck/00check.log:* checking Rd cross-references ... OK
./eplusr.Rcheck/00check.log:* checking for missing documentation entries ... OK
./eplusr.Rcheck/00check.log:* checking for code/documentation mismatches ... OK
./eplusr.Rcheck/00check.log:* checking Rd \usage sections ... OK
./eplusr.Rcheck/00check.log:* checking Rd contents ... OK
./eplusr.Rcheck/00check.log:* checking for unstated dependencies in examples ... OK
./eplusr.Rcheck/00check.log:* checking R/sysdata.rda ... OK
./eplusr.Rcheck/00check.log:* checking examples ... ERROR
./eplusr.Rcheck/00check.log:Running examples in ‘eplusr-Ex.R’ failed
./eplusr.Rcheck/00check.log:The error most likely occurred in:
./eplusr.Rcheck/00check.log:> ### Name: is_eplus_ver
./eplusr.Rcheck/00check.log:> ### Title: Check for Idd, Idf and Epw objects
./eplusr.Rcheck/00check.log:> ### Aliases: is_eplus_ver is_idd_ver is_eplus_path is_idd is_idf
./eplusr.Rcheck/00check.log:> ###   is_iddobject is_idfobject is_epw
./eplusr.Rcheck/00check.log:> 
./eplusr.Rcheck/00check.log:> ### ** Examples
./eplusr.Rcheck/00check.log:> 
./eplusr.Rcheck/00check.log:> is_eplus_ver(8.8)
./eplusr.Rcheck/00check.log:[1] TRUE
./eplusr.Rcheck/00check.log:> is_eplus_ver(8.0)
./eplusr.Rcheck/00check.log:[1] TRUE
./eplusr.Rcheck/00check.log:> is_eplus_ver("latest", strict = FALSE)
./eplusr.Rcheck/00check.log:[1] TRUE
./eplusr.Rcheck/00check.log:> 
./eplusr.Rcheck/00check.log:> is_idd_ver("9.0.1")
./eplusr.Rcheck/00check.log:[1] TRUE
./eplusr.Rcheck/00check.log:> is_idd_ver("8.0.1")
./eplusr.Rcheck/00check.log:[1] FALSE
./eplusr.Rcheck/00check.log:> 
./eplusr.Rcheck/00check.log:> is_eplus_path("C:/EnergyPlusV9-0-0")
./eplusr.Rcheck/00check.log:[1] FALSE
./eplusr.Rcheck/00check.log:> is_eplus_path("/usr/local/EnergyPlus-9-0-1")
./eplusr.Rcheck/00check.log:[1] FALSE
./eplusr.Rcheck/00check.log:> 
./eplusr.Rcheck/00check.log:> is_idd(use_idd(8.8, download = "auto"))
./eplusr.Rcheck/00check.log:IDD v8.8.0 has not been parsed before.
./eplusr.Rcheck/00check.log:Try to locate 'Energy+.idd' in EnergyPlus v8.8.0 installation folder '/usr/local/EnergyPlus-8-8-0'.
./eplusr.Rcheck/00check.log:Failed to locate 'Energy+.idd' because EnergyPlus v8.8.0 is not available.
./eplusr.Rcheck/00check.log:Starting to download the IDD file from EnergyPlus GitHub repo...
./eplusr.Rcheck/00check.log:trying URL 'https://raw.githubusercontent.com/NREL/EnergyPlus/v9.4.0/idd/V8-8-0-Energy%2B.idd'
./eplusr.Rcheck/00check.log:Content type 'text/plain; charset=utf-8' length 4055399 bytes (3.9 MB)
./eplusr.Rcheck/00check.log:==================================================
./eplusr.Rcheck/00check.log:downloaded 3.9 MB
./eplusr.Rcheck/00check.log:EnergyPlus v8.8.0 IDD file 'V8-8-0-Energy+.idd' has been successfully downloaded into /tmp/RtmpkJ6ECM.
./eplusr.Rcheck/00check.log:IDD file found: '/tmp/RtmpkJ6ECM/V8-8-0-Energy+.idd'.
./eplusr.Rcheck/00check.log:Start parsing...
./eplusr.Rcheck/00check.log:Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
./eplusr.Rcheck/00check.log:  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
./eplusr.Rcheck/00check.log:Calls: is_idd ... parse_field_reference_table -> [ -> [.data.table -> vecseq
./eplusr.Rcheck/00check.log:Execution halted
./eplusr.Rcheck/00check.log:* checking for unstated dependencies in ‘tests’ ... OK
./eplusr.Rcheck/00check.log:* checking tests ... ERROR
./eplusr.Rcheck/00check.log:  Running ‘testthat.R’
./eplusr.Rcheck/00check.log:Running the tests in ‘tests/testthat.R’ failed.
./eplusr.Rcheck/00check.log:Last 13 lines of output:
./eplusr.Rcheck/00check.log:    4. │     └─eplusr:::parse_idf_file(path, idd)
./eplusr.Rcheck/00check.log:    5. │       └─eplusr:::get_idd_from_ver(idf_ver, idd)
./eplusr.Rcheck/00check.log:    6. └─eplusr::use_idd(8.8, "auto")
./eplusr.Rcheck/00check.log:    7.   └─eplusr:::read_idd(idd)
./eplusr.Rcheck/00check.log:    8.     └─Idd$new(path)
./eplusr.Rcheck/00check.log:    9.       └─eplusr:::initialize(...)
./eplusr.Rcheck/00check.log:   10.         └─eplusr:::parse_idd_file(path)
./eplusr.Rcheck/00check.log:   11.           └─eplusr:::parse_field_reference_table(dt_field)
./eplusr.Rcheck/00check.log:   12.             ├─refs[obj_fld, on = list(reference = object_list), allow.cartesian = TRUE]
./eplusr.Rcheck/00check.log:   13.             └─data.table:::`[.data.table`(...)
./eplusr.Rcheck/00check.log:   14.               └─data.table:::vecseq(...)
./eplusr.Rcheck/00check.log:  
./eplusr.Rcheck/00check.log:  [ FAIL 47 | WARN 0 | SKIP 58 | PASS 653 ]
./eplusr.Rcheck/00check.log:  Error: Test failures
./eplusr.Rcheck/00check.log:  Execution halted
./eplusr.Rcheck/00check.log:* checking PDF version of manual ... OK
./eplusr.Rcheck/00check.log:* DONE
./eplusr.Rcheck/00check.log:Status: 2 ERRORs

tdhock · 2021-06-27T01:05:43Z

wow that revdep check is very useful.
the bbotk test failure looks like an easy fix (just update expected values, I can do that).
I'm not sure about the eplusr example/test errors. @hongyuanjia you are the eplusr maintainer right? we would like to change the way data.table outputs missing list columns, and eplusr seems to fail (probably as a result of this change). Do those errors look familiar to you? Could you please try updating data.table via remotes::install_github("Rdatatable/data.table@fix-missing-list-column-na-rm") and then see if you can reproduce and fix those errors?

jangorecki · 2021-06-27T07:39:42Z

The problem is that "just 2 would fail" on CRAN, and then many more user code that is not on CRAN. I think we should give an option to allow for more smooth transition. Possibly keeping it disabled at start, and changing default later to active.

tdhock · 2021-06-28T13:45:43Z

By the way, since #5054 was merged, current master gives the following.
It now "works" for the original example with na.rm=T,

> (DT_missing_l_2 <- data.table(num_1=1, num_2=2, l_1=list(1), l_3=list(3)))
   num_1 num_2    l_1    l_3
   <num> <num> <list> <list>
1:     1     2      1      3
> (melt.T <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=T))
     char   num      l
   <char> <num> <list>
1:      1     1      1

However it is still inconsistent with na.rm=F,

> (melt.F <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=F))
     char   num      l
   <char> <num> <list>
1:      1     1      1
2:      2     2       
3:      3    NA      3
> na.omit(melt.F)
     char   num      l
   <char> <num> <list>
1:      1     1      1
2:      2     2

Again, for consistency, the NULL should be changed to NA in the list column, so the result of this na.omit would be the same as above with na.rm=T.

hongyuanjia · 2021-06-29T02:16:13Z

This PR also changes the default behaviour of data.table::dcast.data.table() for list columns. I did not find any option to revert to the original behavior.

library(data.table) # CRAN
options(datatable.print.class = TRUE)

dt1 <- data.table(
    id = 1:3,
    x = c("a", "b", "c"),
    y = list("A", "B", "C")
)

(dt2 <- dcast.data.table(dt1, id ~ x, value.var = "y"))
#>       id      a      b      c
#>    <int> <list> <list> <list>
#> 1:     1      A              
#> 2:     2             B       
#> 3:     3                    C

str(dt2)
#> Classes 'data.table' and 'data.frame':   3 obs. of  4 variables:
#>  $ id: int  1 2 3
#>  $ a :List of 3
#>   ..$ : chr "A"
#>   ..$ : NULL
#>   ..$ : NULL
#>  $ b :List of 3
#>   ..$ : NULL
#>   ..$ : chr "B"
#>   ..$ : NULL
#>  $ c :List of 3
#>   ..$ : NULL
#>   ..$ : NULL
#>   ..$ : chr "C"
#>  - attr(*, ".internal.selfref")=<externalptr> 
#>  - attr(*, "sorted")= chr "id"

library(data.table) # PR
options(datatable.print.class = TRUE)

dt1 <- data.table(
    id = 1:3,
    x = c("a", "b", "c"),
    y = list("A", "B", "C")
)

(dt2 <- dcast.data.table(dt1, id ~ x, value.var = "y"))
#>       id      a      b      c
#>    <int> <list> <list> <list>
#> 1:     1      A     NA     NA
#> 2:     2     NA      B     NA
#> 3:     3     NA     NA      C

str(dt2)
#> Classes 'data.table' and 'data.frame':   3 obs. of  4 variables:
#>  $ id: int  1 2 3
#>  $ a :List of 3
#>   ..$ : chr "A"
#>   ..$ : logi NA
#>   ..$ : logi NA
#>  $ b :List of 3
#>   ..$ : logi NA
#>   ..$ : chr "B"
#>   ..$ : logi NA
#>  $ c :List of 3
#>   ..$ : logi NA
#>   ..$ : logi NA
#>   ..$ : chr "C"
#>  - attr(*, ".internal.selfref")=<externalptr> 
#>  - attr(*, "sorted")= chr "id"

tdhock · 2021-06-30T15:08:10Z

bbotk devs said they agree to make the change, mlr-org/bbotk#147 (comment)

mattdowle · 2021-07-16T17:41:37Z

In an effort to merge this PR and make progress, I added listNA argument to writeNA and set it to true just from melt. This i) passes the new melt test, ii) means that all other test changes can be reverted so no breaking changes, and iii) enables testing the approach of using ScalarLogical() to get to R's global constant R_LogicalNAValue that I suggested above, at least from melt, in case there are any memory or rchk problems then at least that's isolated to melt.

I'm now remembering that using NA instead of NULL in list columns has come up before and we decided against it at the time here: #4198 (comment). Putting an example to what I had in mind there; i.e. NA logical could be a valid non-missing entry in a list column where the list column contains varying lengths of logical sequences :

> DT = data.table(A=c(1,1,2,3,3), B=c(TRUE, NA, NA, NA, NA))
> DT[, .(list(B)), by=A]
       A        V1
   <num>    <list>
1:     1 TRUE,  NA
2:     2        NA
3:     3     NA,NA
>

That NA on the 2nd row represents a non-missing logical vector length 1 containing NA.

However, that's just one edge case and not the big picture. We could still change to NA in list columns: it is being requested after all. But it would need a consultation exercise with users, an option to revert, and potentially-breaking-section in news. In the meantime, adding listNA to writeNA was a way to make progress on this melt feature.

tdhock · 2021-07-18T03:09:34Z

sounds like a great plan, thanks Matt.

melt(na.rm=TRUE) should remove rows with missing list column

44d6f61

tdhock added the consistency label Jun 25, 2021

tdhock marked this pull request as draft June 25, 2021 15:39

tdhock added 2 commits June 25, 2021 12:25

use logical NA scalar instead of NULL for missing list elements

18e106e

plonk list yields NA scalar instead of NULL

83ffa5d

tdhock marked this pull request as ready for review June 25, 2021 16:29

melt outputs NA instead of NULL in rows corresponding to missing list…

a06651b

… columns

tdhock requested a review from mattdowle June 25, 2021 18:05

tdhock added the reshape dcast melt label Jun 25, 2021

mattdowle reviewed Jun 26, 2021

View reviewed changes

Merge branch 'master' into fix-missing-list-column-na-rm

c5ef5dc

tdhock mentioned this pull request Jun 27, 2021

expect list elements are NA instead of NULL mlr-org/bbotk#147

Closed

melt(na.rm=FALSE) should return NA when input DT has missing list column

771e39a

tdhock and others added 3 commits July 1, 2021 12:15

na.rm no longer ignored for list columns

8af77e8

Merge branch 'master' into fix-missing-list-column-na-rm

be77d1a

added listNA to writeNA, true just from melt for now

44a46c8

mattdowle added this to the 1.14.1 milestone Jul 16, 2021

mattdowle merged commit 129366e into master Jul 16, 2021

mattdowle deleted the fix-missing-list-column-na-rm branch July 16, 2021 17:52

mattdowle mentioned this pull request Jul 16, 2021

dev breaks revdeps NMdata and scoringutils #5075

Closed

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

melt(na.rm=TRUE) should remove rows with missing list column #5053

melt(na.rm=TRUE) should remove rows with missing list column #5053

tdhock commented Jun 25, 2021 •

edited

Loading

tdhock commented Jun 25, 2021

mattdowle Jun 26, 2021 •

edited

Loading

tdhock Jun 27, 2021

mattdowle Jun 26, 2021

tdhock Jun 27, 2021

mattdowle commented Jun 26, 2021 •

edited

Loading

mattdowle commented Jun 26, 2021

tdhock commented Jun 27, 2021

jangorecki commented Jun 27, 2021

tdhock commented Jun 28, 2021

hongyuanjia commented Jun 29, 2021

tdhock commented Jun 30, 2021

mattdowle commented Jul 16, 2021 •

edited

Loading

tdhock commented Jul 18, 2021

melt(na.rm=TRUE) should remove rows with missing list column #5053

melt(na.rm=TRUE) should remove rows with missing list column #5053

Conversation

tdhock commented Jun 25, 2021 • edited Loading

tdhock commented Jun 25, 2021

mattdowle Jun 26, 2021 • edited Loading

Choose a reason for hiding this comment

tdhock Jun 27, 2021

Choose a reason for hiding this comment

mattdowle Jun 26, 2021

Choose a reason for hiding this comment

tdhock Jun 27, 2021

Choose a reason for hiding this comment

mattdowle commented Jun 26, 2021 • edited Loading

mattdowle commented Jun 26, 2021

tdhock commented Jun 27, 2021

jangorecki commented Jun 27, 2021

tdhock commented Jun 28, 2021

hongyuanjia commented Jun 29, 2021

tdhock commented Jun 30, 2021

mattdowle commented Jul 16, 2021 • edited Loading

tdhock commented Jul 18, 2021

tdhock commented Jun 25, 2021 •

edited

Loading

mattdowle Jun 26, 2021 •

edited

Loading

mattdowle commented Jun 26, 2021 •

edited

Loading

mattdowle commented Jul 16, 2021 •

edited

Loading