Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

melt(na.rm=TRUE) should remove rows with missing list column #5053

Merged
merged 9 commits into from
Jul 16, 2021

Conversation

tdhock
Copy link
Member

@tdhock tdhock commented Jun 25, 2021

I found another edge case worth fixing for melt on data with a missing list column. Here is a data table with "missing" input columns (l_2 and num_3).

> (DT_missing_l_2 <- data.table(num_1=1, num_2=2, l_1=list(1), l_3=list(3)))
   num_1 num_2    l_1    l_3
   <num> <num> <list> <list>
1:     1     2      1      3

When we melt with na.rm=F we see that third row correctly contains NA but second row in the list column contains NULL (which is NOT treated as missing, inconsistent with other types).

> (melt.F <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=F))
     char   num      l
   <char> <num> <list>
1:      1     1      1
2:      2     2       
3:      3    NA      3
> str(melt.F)
Classes ‘data.table’ and 'data.frame':	3 obs. of  3 variables:
 $ char: chr  "1" "2" "3"
 $ num : num  1 2 NA
 $ l   :List of 3
  ..$ : num 1
  ..$ : NULL
  ..$ : num 3
 - attr(*, ".internal.selfref")=<externalptr> 

Then when we melt with na.rm=T we see that third row is removed (correct) but second row is kept (incorrect),

> (melt.T <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=T))
     char   num      l
   <char> <num> <list>
1:      1     1      1
2:      2     2       

I will investigate a fix.

@tdhock tdhock marked this pull request as draft June 25, 2021 15:39
@tdhock
Copy link
Member Author

tdhock commented Jun 25, 2021

hi again my fix involves a small change to writeNA in assign.c -- a logical NA scalar is used instead of NULL/R_NilValue for each list/VECSXP element. The example code above now yields:

> (DT_missing_l_2 <- data.table(num_1=1, num_2=2, l_1=list(1), l_3=list(3)))
   num_1 num_2    l_1    l_3
   <num> <num> <list> <list>
1:     1     2      1      3
> (melt.F <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=F))
     char   num      l
   <char> <num> <list>
1:      1     1      1
2:      2     2     NA
3:      3    NA      3
> (melt.T <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=T))
     char   num      l
   <char> <num> <list>
1:      1     1      1

There were several tests which were affected by this (plonking empty list column, etc). Previously they expected NULL, but I had to change that to NA, is that OK?

> melt.T[, plonk := list()][]
     char   num      l  plonk
   <char> <num> <list> <list>
1:      1     1      1     NA

@tdhock tdhock marked this pull request as ready for review June 25, 2021 16:29
@tdhock tdhock requested a review from mattdowle June 25, 2021 18:05
@tdhock tdhock added the reshape dcast melt label Jun 25, 2021
# Test plonk list variable (to catch deparse treating j=list() specially)
x = list(2,"b",2.718)
test(470, DT[,baz:=x], data.table(a=1:3,b=4:6,foo=list(),bar=list(1,"a",3.14),baz=list(2,"b",2.718)))
test(470, DT[,baz:=x], data.table(a=1:3,b=4:6,foo=list(NA,NA,NA),bar=list(1,"a",3.14),baz=list(2,"b",2.718)))
Copy link
Member

@mattdowle mattdowle Jun 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to melt looks good and I was about to merge. But then, yes as you mentioned, these test changes to list columns in DT[...] too is causing me to hesitate. If we go ahead then the news item would need to mention it in an example in an R code block, in a potentially-breaking-changes (PBC) section at the top, and we'd bump to 1.15.0 to convey the PBC. Maybe also a consultation exercise first on Twitter to gauge user opinion. I suspect the outcome would be positive to go ahead. I can't think of any downsides really other than it being a change (caveat internals, see comment below).
dev is currently passing all 1,075 revdeps (I did a full rerun yesterday) so we could release this one soon (1.14.2) and then this PR would go into dev to be become 1.15.0. I could also do a revdep rerun with this PR and see which (if any) revdeps break; either way it would be useful to know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, either way that sounds good.

src/assign.c Outdated
SEXP na_scalar = allocVector(LGLSXP, 1);
LOGICAL(na_scalar)[0] = NA_LOGICAL;
SET_VECTOR_ELT(v, i, na_scalar);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest, if we go ahead with this change, using R_LogicalNAValue here. That's R's internal constant object to save the multiple allocations (R has R_TrueValue and R_FalseValue too). However, then I remembered packages don't have access to those internals. We can use ScalarLogical(NA_LOGICAL), though, and that returns R_LogicalNAValue without allocating. So this loop would become :

SEXP na = ScalarLogical(NA_LOGICAL);
for (int i=from; i<=to; ++i) SET_VECTOR_ELT(v, i, na);

However, lets say := or set was then used to write FALSE into some of those list column cells containing NA. That would have the potential to change R's global constant (that has happened before).
However2, we're talking about items within a list column cell here, and I suspect := and set don't have the ability to update that nested level. What happened before was with 1-row data.table's containing a logical column: that one-row logical column could be R_[True|False|LogicalNA]Value and a := on such a one-row column did corrupt R's internal value. So we catch that now by reallocating 1-row logical columns.
However3, if := and set don't already detect and prevent writing to R_[True|False|LogicalNA]Value, we should put that in regardless (they could then either change the pointer rather than the contents of the length-1 logical, or they could allocate a new length-1 logical if an attribute is being attached by user.)
So, in summary, after writing out loud here, I would rule out this loop as it stands now for the reason of allocating all those length-1 objects could gobble memory in large cases. Which is maybe why we used empty list (R_NilValue). If we go ahead with the NULL to NA change, we could review set() and := w.r.t. length-1 logicals, and if ok, use ScalarLogical(NA_LOGICAL) here.
Maybe even the recent att="t" attached to R_FalseValue problem could be resolved in a different more robust way inside set/:= rather than what I did.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the potential overhead in allocating a lot of length-1 logical vectors as well. I agree that using R_LogicalNA seems like a great / more efficient alternative. I did not know that := and set could change that, but yes I agree that seems like something := / set should detect/handle, and that should not stop us from using a more efficient code here.

@mattdowle
Copy link
Member

mattdowle commented Jun 26, 2021

I merged #5054 into main, and then merged main into this PR, so #5054 is included in this branch (I noticed it had a writeNA).
Starting a full rerun of revdeps now on this branch ...

@mattdowle
Copy link
Member

Just 2 CRAN revdeps would fail : bbotk and eplusr.
(tidytransit warning is unrelated and happens currently on CRAN with 1.14.0)

CRAN:
 ERROR   :    2 : bbotk eplusr 
 WARNING :    1 : tidytransit 
 NOTE    :  357 
 OK      :  715 
 TOTAL   : 1075 / 1075
bbotk_0.3.2.tar.gz
./bbotk.Rcheck/00check.log:* using log directory ‘/home/mdowle/build/revdeplib/bbotk.Rcheck’
./bbotk.Rcheck/00check.log:* using R version 4.0.3 (2020-10-10)
./bbotk.Rcheck/00check.log:* using platform: x86_64-pc-linux-gnu (64-bit)
./bbotk.Rcheck/00check.log:* using session charset: UTF-8
./bbotk.Rcheck/00check.log:* checking for file ‘bbotk/DESCRIPTION’ ... OK
./bbotk.Rcheck/00check.log:* this is package ‘bbotk’ version ‘0.3.2’
./bbotk.Rcheck/00check.log:* package encoding: UTF-8
./bbotk.Rcheck/00check.log:* checking package namespace information ... OK
./bbotk.Rcheck/00check.log:* checking package dependencies ... OK
./bbotk.Rcheck/00check.log:* checking if this is a source package ... OK
./bbotk.Rcheck/00check.log:* checking if there is a namespace ... OK
./bbotk.Rcheck/00check.log:* checking for executable files ... OK
./bbotk.Rcheck/00check.log:* checking for hidden files and directories ... OK
./bbotk.Rcheck/00check.log:* checking for portable file names ... OK
./bbotk.Rcheck/00check.log:* checking for sufficient/correct file permissions ... OK
./bbotk.Rcheck/00check.log:* checking whether package ‘bbotk’ can be installed ... OK
./bbotk.Rcheck/00check.log:* checking installed package size ... OK
./bbotk.Rcheck/00check.log:* checking package directory ... OK
./bbotk.Rcheck/00check.log:* checking ‘build’ directory ... OK
./bbotk.Rcheck/00check.log:* checking DESCRIPTION meta-information ... OK
./bbotk.Rcheck/00check.log:* checking top-level files ... OK
./bbotk.Rcheck/00check.log:* checking for left-over files ... OK
./bbotk.Rcheck/00check.log:* checking index information ... OK
./bbotk.Rcheck/00check.log:* checking package subdirectories ... OK
./bbotk.Rcheck/00check.log:* checking R files for non-ASCII characters ... OK
./bbotk.Rcheck/00check.log:* checking R files for syntax errors ... OK
./bbotk.Rcheck/00check.log:* checking whether the package can be loaded ... OK
./bbotk.Rcheck/00check.log:* checking whether the package can be loaded with stated dependencies ... OK
./bbotk.Rcheck/00check.log:* checking whether the package can be unloaded cleanly ... OK
./bbotk.Rcheck/00check.log:* checking whether the namespace can be loaded with stated dependencies ... OK
./bbotk.Rcheck/00check.log:* checking whether the namespace can be unloaded cleanly ... OK
./bbotk.Rcheck/00check.log:* checking loading without being on the library search path ... OK
./bbotk.Rcheck/00check.log:* checking dependencies in R code ... OK
./bbotk.Rcheck/00check.log:* checking S3 generic/method consistency ... OK
./bbotk.Rcheck/00check.log:* checking replacement functions ... OK
./bbotk.Rcheck/00check.log:* checking foreign function calls ... OK
./bbotk.Rcheck/00check.log:* checking R code for possible problems ... OK
./bbotk.Rcheck/00check.log:* checking Rd files ... OK
./bbotk.Rcheck/00check.log:* checking Rd metadata ... OK
./bbotk.Rcheck/00check.log:* checking Rd cross-references ... OK
./bbotk.Rcheck/00check.log:* checking for missing documentation entries ... OK
./bbotk.Rcheck/00check.log:* checking for code/documentation mismatches ... OK
./bbotk.Rcheck/00check.log:* checking Rd \usage sections ... OK
./bbotk.Rcheck/00check.log:* checking Rd contents ... OK
./bbotk.Rcheck/00check.log:* checking for unstated dependencies in examples ... OK
./bbotk.Rcheck/00check.log:* checking line endings in C/C++/Fortran sources/headers ... OK
./bbotk.Rcheck/00check.log:* checking compiled code ... OK
./bbotk.Rcheck/00check.log:* checking installed files from ‘inst/doc’ ... OK
./bbotk.Rcheck/00check.log:* checking files in ‘vignettes’ ... OK
./bbotk.Rcheck/00check.log:* checking examples ... OK
./bbotk.Rcheck/00check.log:* checking for unstated dependencies in ‘tests’ ... OK
./bbotk.Rcheck/00check.log:* checking tests ... ERROR
./bbotk.Rcheck/00check.log:  Running ‘testthat.R’
./bbotk.Rcheck/00check.log:Running the tests in ‘tests/testthat.R’ failed.
./bbotk.Rcheck/00check.log:Last 13 lines of output:
./bbotk.Rcheck/00check.log:  ── Failure (test_OptimInstanceSingleCrit.R:64:3): OptimInstance works with extras input ──
./bbotk.Rcheck/00check.log:  inst$archive$data$extra3[1:3] (`actual`) not equal to list(NULL, NULL, NULL) (`expected`).
./bbotk.Rcheck/00check.log:  
./bbotk.Rcheck/00check.log:  `actual[[1]]` is a logical vector (NA)
./bbotk.Rcheck/00check.log:  `expected[[1]]` is NULL
./bbotk.Rcheck/00check.log:  
./bbotk.Rcheck/00check.log:  `actual[[2]]` is a logical vector (NA)
./bbotk.Rcheck/00check.log:  `expected[[2]]` is NULL
./bbotk.Rcheck/00check.log:  
./bbotk.Rcheck/00check.log:  `actual[[3]]` is a logical vector (NA)
./bbotk.Rcheck/00check.log:  `expected[[3]]` is NULL
./bbotk.Rcheck/00check.log:  
./bbotk.Rcheck/00check.log:  [ FAIL 1 | WARN 6 | SKIP 1 | PASS 386 ]
./bbotk.Rcheck/00check.log:  Error: Test failures
./bbotk.Rcheck/00check.log:  Execution halted
./bbotk.Rcheck/00check.log:* checking for unstated dependencies in vignettes ... OK
./bbotk.Rcheck/00check.log:* checking package vignettes in ‘inst/doc’ ... OK
./bbotk.Rcheck/00check.log:* checking running R code from vignettes ... NONE
./bbotk.Rcheck/00check.log:  ‘bbotk.Rmd’ using ‘UTF-8’... OK
./bbotk.Rcheck/00check.log:* checking re-building of vignette outputs ... OK
./bbotk.Rcheck/00check.log:* checking PDF version of manual ... OK
./bbotk.Rcheck/00check.log:* DONE
./bbotk.Rcheck/00check.log:Status: 1 ERROR


eplusr_0.14.2.tar.gz
./eplusr.Rcheck/00check.log:* using log directory ‘/home/mdowle/build/revdeplib/eplusr.Rcheck’
./eplusr.Rcheck/00check.log:* using R version 4.0.3 (2020-10-10)
./eplusr.Rcheck/00check.log:* using platform: x86_64-pc-linux-gnu (64-bit)
./eplusr.Rcheck/00check.log:* using session charset: UTF-8
./eplusr.Rcheck/00check.log:* checking for file ‘eplusr/DESCRIPTION’ ... OK
./eplusr.Rcheck/00check.log:* this is package ‘eplusr’ version ‘0.14.2’
./eplusr.Rcheck/00check.log:* package encoding: UTF-8
./eplusr.Rcheck/00check.log:* checking package namespace information ... OK
./eplusr.Rcheck/00check.log:* checking package dependencies ... OK
./eplusr.Rcheck/00check.log:* checking if this is a source package ... OK
./eplusr.Rcheck/00check.log:* checking if there is a namespace ... OK
./eplusr.Rcheck/00check.log:* checking for executable files ... OK
./eplusr.Rcheck/00check.log:* checking for hidden files and directories ... OK
./eplusr.Rcheck/00check.log:* checking for portable file names ... OK
./eplusr.Rcheck/00check.log:* checking for sufficient/correct file permissions ... OK
./eplusr.Rcheck/00check.log:* checking whether package ‘eplusr’ can be installed ... OK
./eplusr.Rcheck/00check.log:* checking installed package size ... OK
./eplusr.Rcheck/00check.log:* checking package directory ... OK
./eplusr.Rcheck/00check.log:* checking DESCRIPTION meta-information ... OK
./eplusr.Rcheck/00check.log:* checking top-level files ... OK
./eplusr.Rcheck/00check.log:* checking for left-over files ... OK
./eplusr.Rcheck/00check.log:* checking index information ... OK
./eplusr.Rcheck/00check.log:* checking package subdirectories ... OK
./eplusr.Rcheck/00check.log:* checking R files for non-ASCII characters ... OK
./eplusr.Rcheck/00check.log:* checking R files for syntax errors ... OK
./eplusr.Rcheck/00check.log:* checking whether the package can be loaded ... OK
./eplusr.Rcheck/00check.log:* checking whether the package can be loaded with stated dependencies ... OK
./eplusr.Rcheck/00check.log:* checking whether the package can be unloaded cleanly ... OK
./eplusr.Rcheck/00check.log:* checking whether the namespace can be loaded with stated dependencies ... OK
./eplusr.Rcheck/00check.log:* checking whether the namespace can be unloaded cleanly ... OK
./eplusr.Rcheck/00check.log:* checking loading without being on the library search path ... OK
./eplusr.Rcheck/00check.log:* checking dependencies in R code ... OK
./eplusr.Rcheck/00check.log:* checking S3 generic/method consistency ... OK
./eplusr.Rcheck/00check.log:* checking replacement functions ... OK
./eplusr.Rcheck/00check.log:* checking foreign function calls ... OK
./eplusr.Rcheck/00check.log:* checking R code for possible problems ... OK
./eplusr.Rcheck/00check.log:* checking Rd files ... OK
./eplusr.Rcheck/00check.log:* checking Rd metadata ... OK
./eplusr.Rcheck/00check.log:* checking Rd cross-references ... OK
./eplusr.Rcheck/00check.log:* checking for missing documentation entries ... OK
./eplusr.Rcheck/00check.log:* checking for code/documentation mismatches ... OK
./eplusr.Rcheck/00check.log:* checking Rd \usage sections ... OK
./eplusr.Rcheck/00check.log:* checking Rd contents ... OK
./eplusr.Rcheck/00check.log:* checking for unstated dependencies in examples ... OK
./eplusr.Rcheck/00check.log:* checking R/sysdata.rda ... OK
./eplusr.Rcheck/00check.log:* checking examples ... ERROR
./eplusr.Rcheck/00check.log:Running examples in ‘eplusr-Ex.R’ failed
./eplusr.Rcheck/00check.log:The error most likely occurred in:
./eplusr.Rcheck/00check.log:> ### Name: is_eplus_ver
./eplusr.Rcheck/00check.log:> ### Title: Check for Idd, Idf and Epw objects
./eplusr.Rcheck/00check.log:> ### Aliases: is_eplus_ver is_idd_ver is_eplus_path is_idd is_idf
./eplusr.Rcheck/00check.log:> ###   is_iddobject is_idfobject is_epw
./eplusr.Rcheck/00check.log:> 
./eplusr.Rcheck/00check.log:> ### ** Examples
./eplusr.Rcheck/00check.log:> 
./eplusr.Rcheck/00check.log:> is_eplus_ver(8.8)
./eplusr.Rcheck/00check.log:[1] TRUE
./eplusr.Rcheck/00check.log:> is_eplus_ver(8.0)
./eplusr.Rcheck/00check.log:[1] TRUE
./eplusr.Rcheck/00check.log:> is_eplus_ver("latest", strict = FALSE)
./eplusr.Rcheck/00check.log:[1] TRUE
./eplusr.Rcheck/00check.log:> 
./eplusr.Rcheck/00check.log:> is_idd_ver("9.0.1")
./eplusr.Rcheck/00check.log:[1] TRUE
./eplusr.Rcheck/00check.log:> is_idd_ver("8.0.1")
./eplusr.Rcheck/00check.log:[1] FALSE
./eplusr.Rcheck/00check.log:> 
./eplusr.Rcheck/00check.log:> is_eplus_path("C:/EnergyPlusV9-0-0")
./eplusr.Rcheck/00check.log:[1] FALSE
./eplusr.Rcheck/00check.log:> is_eplus_path("/usr/local/EnergyPlus-9-0-1")
./eplusr.Rcheck/00check.log:[1] FALSE
./eplusr.Rcheck/00check.log:> 
./eplusr.Rcheck/00check.log:> is_idd(use_idd(8.8, download = "auto"))
./eplusr.Rcheck/00check.log:IDD v8.8.0 has not been parsed before.
./eplusr.Rcheck/00check.log:Try to locate 'Energy+.idd' in EnergyPlus v8.8.0 installation folder '/usr/local/EnergyPlus-8-8-0'.
./eplusr.Rcheck/00check.log:Failed to locate 'Energy+.idd' because EnergyPlus v8.8.0 is not available.
./eplusr.Rcheck/00check.log:Starting to download the IDD file from EnergyPlus GitHub repo...
./eplusr.Rcheck/00check.log:trying URL 'https://raw.githubusercontent.com/NREL/EnergyPlus/v9.4.0/idd/V8-8-0-Energy%2B.idd'
./eplusr.Rcheck/00check.log:Content type 'text/plain; charset=utf-8' length 4055399 bytes (3.9 MB)
./eplusr.Rcheck/00check.log:==================================================
./eplusr.Rcheck/00check.log:downloaded 3.9 MB
./eplusr.Rcheck/00check.log:EnergyPlus v8.8.0 IDD file 'V8-8-0-Energy+.idd' has been successfully downloaded into /tmp/RtmpkJ6ECM.
./eplusr.Rcheck/00check.log:IDD file found: '/tmp/RtmpkJ6ECM/V8-8-0-Energy+.idd'.
./eplusr.Rcheck/00check.log:Start parsing...
./eplusr.Rcheck/00check.log:Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
./eplusr.Rcheck/00check.log:  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
./eplusr.Rcheck/00check.log:Calls: is_idd ... parse_field_reference_table -> [ -> [.data.table -> vecseq
./eplusr.Rcheck/00check.log:Execution halted
./eplusr.Rcheck/00check.log:* checking for unstated dependencies in ‘tests’ ... OK
./eplusr.Rcheck/00check.log:* checking tests ... ERROR
./eplusr.Rcheck/00check.log:  Running ‘testthat.R’
./eplusr.Rcheck/00check.log:Running the tests in ‘tests/testthat.R’ failed.
./eplusr.Rcheck/00check.log:Last 13 lines of output:
./eplusr.Rcheck/00check.log:    4. │     └─eplusr:::parse_idf_file(path, idd)
./eplusr.Rcheck/00check.log:    5. │       └─eplusr:::get_idd_from_ver(idf_ver, idd)
./eplusr.Rcheck/00check.log:    6. └─eplusr::use_idd(8.8, "auto")
./eplusr.Rcheck/00check.log:    7.   └─eplusr:::read_idd(idd)
./eplusr.Rcheck/00check.log:    8.     └─Idd$new(path)
./eplusr.Rcheck/00check.log:    9.       └─eplusr:::initialize(...)
./eplusr.Rcheck/00check.log:   10.         └─eplusr:::parse_idd_file(path)
./eplusr.Rcheck/00check.log:   11.           └─eplusr:::parse_field_reference_table(dt_field)
./eplusr.Rcheck/00check.log:   12.             ├─refs[obj_fld, on = list(reference = object_list), allow.cartesian = TRUE]
./eplusr.Rcheck/00check.log:   13.             └─data.table:::`[.data.table`(...)
./eplusr.Rcheck/00check.log:   14.               └─data.table:::vecseq(...)
./eplusr.Rcheck/00check.log:  
./eplusr.Rcheck/00check.log:  [ FAIL 47 | WARN 0 | SKIP 58 | PASS 653 ]
./eplusr.Rcheck/00check.log:  Error: Test failures
./eplusr.Rcheck/00check.log:  Execution halted
./eplusr.Rcheck/00check.log:* checking PDF version of manual ... OK
./eplusr.Rcheck/00check.log:* DONE
./eplusr.Rcheck/00check.log:Status: 2 ERRORs

@tdhock
Copy link
Member Author

tdhock commented Jun 27, 2021

wow that revdep check is very useful.
the bbotk test failure looks like an easy fix (just update expected values, I can do that).
I'm not sure about the eplusr example/test errors. @hongyuanjia you are the eplusr maintainer right? we would like to change the way data.table outputs missing list columns, and eplusr seems to fail (probably as a result of this change). Do those errors look familiar to you? Could you please try updating data.table via remotes::install_github("Rdatatable/data.table@fix-missing-list-column-na-rm") and then see if you can reproduce and fix those errors?

@jangorecki
Copy link
Member

The problem is that "just 2 would fail" on CRAN, and then many more user code that is not on CRAN. I think we should give an option to allow for more smooth transition. Possibly keeping it disabled at start, and changing default later to active.

@tdhock
Copy link
Member Author

tdhock commented Jun 28, 2021

By the way, since #5054 was merged, current master gives the following.
It now "works" for the original example with na.rm=T,

> (DT_missing_l_2 <- data.table(num_1=1, num_2=2, l_1=list(1), l_3=list(3)))
   num_1 num_2    l_1    l_3
   <num> <num> <list> <list>
1:     1     2      1      3
> (melt.T <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=T))
     char   num      l
   <char> <num> <list>
1:      1     1      1

However it is still inconsistent with na.rm=F,

> (melt.F <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=F))
     char   num      l
   <char> <num> <list>
1:      1     1      1
2:      2     2       
3:      3    NA      3
> na.omit(melt.F)
     char   num      l
   <char> <num> <list>
1:      1     1      1
2:      2     2       

Again, for consistency, the NULL should be changed to NA in the list column, so the result of this na.omit would be the same as above with na.rm=T.

@hongyuanjia
Copy link
Contributor

This PR also changes the default behaviour of data.table::dcast.data.table() for list columns. I did not find any option to revert to the original behavior.

library(data.table) # CRAN
options(datatable.print.class = TRUE)

dt1 <- data.table(
    id = 1:3,
    x = c("a", "b", "c"),
    y = list("A", "B", "C")
)

(dt2 <- dcast.data.table(dt1, id ~ x, value.var = "y"))
#>       id      a      b      c
#>    <int> <list> <list> <list>
#> 1:     1      A              
#> 2:     2             B       
#> 3:     3                    C

str(dt2)
#> Classes 'data.table' and 'data.frame':   3 obs. of  4 variables:
#>  $ id: int  1 2 3
#>  $ a :List of 3
#>   ..$ : chr "A"
#>   ..$ : NULL
#>   ..$ : NULL
#>  $ b :List of 3
#>   ..$ : NULL
#>   ..$ : chr "B"
#>   ..$ : NULL
#>  $ c :List of 3
#>   ..$ : NULL
#>   ..$ : NULL
#>   ..$ : chr "C"
#>  - attr(*, ".internal.selfref")=<externalptr> 
#>  - attr(*, "sorted")= chr "id"
library(data.table) # PR
options(datatable.print.class = TRUE)

dt1 <- data.table(
    id = 1:3,
    x = c("a", "b", "c"),
    y = list("A", "B", "C")
)

(dt2 <- dcast.data.table(dt1, id ~ x, value.var = "y"))
#>       id      a      b      c
#>    <int> <list> <list> <list>
#> 1:     1      A     NA     NA
#> 2:     2     NA      B     NA
#> 3:     3     NA     NA      C

str(dt2)
#> Classes 'data.table' and 'data.frame':   3 obs. of  4 variables:
#>  $ id: int  1 2 3
#>  $ a :List of 3
#>   ..$ : chr "A"
#>   ..$ : logi NA
#>   ..$ : logi NA
#>  $ b :List of 3
#>   ..$ : logi NA
#>   ..$ : chr "B"
#>   ..$ : logi NA
#>  $ c :List of 3
#>   ..$ : logi NA
#>   ..$ : logi NA
#>   ..$ : chr "C"
#>  - attr(*, ".internal.selfref")=<externalptr> 
#>  - attr(*, "sorted")= chr "id"

@tdhock
Copy link
Member Author

tdhock commented Jun 30, 2021

bbotk devs said they agree to make the change, mlr-org/bbotk#147 (comment)

@mattdowle mattdowle added this to the 1.14.1 milestone Jul 16, 2021
@mattdowle
Copy link
Member

mattdowle commented Jul 16, 2021

In an effort to merge this PR and make progress, I added listNA argument to writeNA and set it to true just from melt. This i) passes the new melt test, ii) means that all other test changes can be reverted so no breaking changes, and iii) enables testing the approach of using ScalarLogical() to get to R's global constant R_LogicalNAValue that I suggested above, at least from melt, in case there are any memory or rchk problems then at least that's isolated to melt.

I'm now remembering that using NA instead of NULL in list columns has come up before and we decided against it at the time here: #4198 (comment). Putting an example to what I had in mind there; i.e. NA logical could be a valid non-missing entry in a list column where the list column contains varying lengths of logical sequences :

> DT = data.table(A=c(1,1,2,3,3), B=c(TRUE, NA, NA, NA, NA))
> DT[, .(list(B)), by=A]
       A        V1
   <num>    <list>
1:     1 TRUE,  NA
2:     2        NA
3:     3     NA,NA
>

That NA on the 2nd row represents a non-missing logical vector length 1 containing NA.

However, that's just one edge case and not the big picture. We could still change to NA in list columns: it is being requested after all. But it would need a consultation exercise with users, an option to revert, and potentially-breaking-section in news. In the meantime, adding listNA to writeNA was a way to make progress on this melt feature.

@mattdowle mattdowle merged commit 129366e into master Jul 16, 2021
@mattdowle mattdowle deleted the fix-missing-list-column-na-rm branch July 16, 2021 17:52
@tdhock
Copy link
Member Author

tdhock commented Jul 18, 2021

sounds like a great plan, thanks Matt.

@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants