-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
melt(na.rm=TRUE) should remove rows with missing list column #5053
Conversation
hi again my fix involves a small change to writeNA in assign.c -- a logical NA scalar is used instead of NULL/R_NilValue for each list/VECSXP element. The example code above now yields: > (DT_missing_l_2 <- data.table(num_1=1, num_2=2, l_1=list(1), l_3=list(3)))
num_1 num_2 l_1 l_3
<num> <num> <list> <list>
1: 1 2 1 3
> (melt.F <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=F))
char num l
<char> <num> <list>
1: 1 1 1
2: 2 2 NA
3: 3 NA 3
> (melt.T <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=T))
char num l
<char> <num> <list>
1: 1 1 1 There were several tests which were affected by this (plonking empty list column, etc). Previously they expected NULL, but I had to change that to NA, is that OK? > melt.T[, plonk := list()][]
char num l plonk
<char> <num> <list> <list>
1: 1 1 1 NA |
inst/tests/tests.Rraw
Outdated
# Test plonk list variable (to catch deparse treating j=list() specially) | ||
x = list(2,"b",2.718) | ||
test(470, DT[,baz:=x], data.table(a=1:3,b=4:6,foo=list(),bar=list(1,"a",3.14),baz=list(2,"b",2.718))) | ||
test(470, DT[,baz:=x], data.table(a=1:3,b=4:6,foo=list(NA,NA,NA),bar=list(1,"a",3.14),baz=list(2,"b",2.718))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change to melt
looks good and I was about to merge. But then, yes as you mentioned, these test changes to list columns in DT[...]
too is causing me to hesitate. If we go ahead then the news item would need to mention it in an example in an R code block, in a potentially-breaking-changes (PBC) section at the top, and we'd bump to 1.15.0 to convey the PBC. Maybe also a consultation exercise first on Twitter to gauge user opinion. I suspect the outcome would be positive to go ahead. I can't think of any downsides really other than it being a change (caveat internals, see comment below).
dev is currently passing all 1,075 revdeps (I did a full rerun yesterday) so we could release this one soon (1.14.2) and then this PR would go into dev to be become 1.15.0. I could also do a revdep rerun with this PR and see which (if any) revdeps break; either way it would be useful to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, either way that sounds good.
src/assign.c
Outdated
SEXP na_scalar = allocVector(LGLSXP, 1); | ||
LOGICAL(na_scalar)[0] = NA_LOGICAL; | ||
SET_VECTOR_ELT(v, i, na_scalar); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to suggest, if we go ahead with this change, using R_LogicalNAValue
here. That's R's internal constant object to save the multiple allocations (R has R_TrueValue
and R_FalseValue
too). However, then I remembered packages don't have access to those internals. We can use ScalarLogical(NA_LOGICAL)
, though, and that returns R_LogicalNAValue
without allocating. So this loop would become :
SEXP na = ScalarLogical(NA_LOGICAL);
for (int i=from; i<=to; ++i) SET_VECTOR_ELT(v, i, na);
However, lets say :=
or set
was then used to write FALSE into some of those list column cells containing NA. That would have the potential to change R's global constant (that has happened before).
However2, we're talking about items within a list column cell here, and I suspect :=
and set
don't have the ability to update that nested level. What happened before was with 1-row data.table's containing a logical column: that one-row logical column could be R_[True|False|LogicalNA]Value
and a :=
on such a one-row column did corrupt R's internal value. So we catch that now by reallocating 1-row logical columns.
However3, if :=
and set
don't already detect and prevent writing to R_[True|False|LogicalNA]Value
, we should put that in regardless (they could then either change the pointer rather than the contents of the length-1 logical, or they could allocate a new length-1 logical if an attribute is being attached by user.)
So, in summary, after writing out loud here, I would rule out this loop as it stands now for the reason of allocating all those length-1 objects could gobble memory in large cases. Which is maybe why we used empty list (R_NilValue). If we go ahead with the NULL to NA change, we could review set()
and :=
w.r.t. length-1 logicals, and if ok, use ScalarLogical(NA_LOGICAL)
here.
Maybe even the recent att="t"
attached to R_FalseValue
problem could be resolved in a different more robust way inside set
/:=
rather than what I did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the potential overhead in allocating a lot of length-1 logical vectors as well. I agree that using R_LogicalNA seems like a great / more efficient alternative. I did not know that :=
and set
could change that, but yes I agree that seems like something :=
/ set
should detect/handle, and that should not stop us from using a more efficient code here.
Just 2 CRAN revdeps would fail :
|
wow that revdep check is very useful. |
The problem is that "just 2 would fail" on CRAN, and then many more user code that is not on CRAN. I think we should give an option to allow for more smooth transition. Possibly keeping it disabled at start, and changing default later to active. |
By the way, since #5054 was merged, current master gives the following. > (DT_missing_l_2 <- data.table(num_1=1, num_2=2, l_1=list(1), l_3=list(3)))
num_1 num_2 l_1 l_3
<num> <num> <list> <list>
1: 1 2 1 3
> (melt.T <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=T))
char num l
<char> <num> <list>
1: 1 1 1 However it is still inconsistent with na.rm=F, > (melt.F <- melt(DT_missing_l_2, measure.vars=measure(value.name, char), na.rm=F))
char num l
<char> <num> <list>
1: 1 1 1
2: 2 2
3: 3 NA 3
> na.omit(melt.F)
char num l
<char> <num> <list>
1: 1 1 1
2: 2 2 Again, for consistency, the NULL should be changed to NA in the list column, so the result of this |
This PR also changes the default behaviour of library(data.table) # CRAN
options(datatable.print.class = TRUE)
dt1 <- data.table(
id = 1:3,
x = c("a", "b", "c"),
y = list("A", "B", "C")
)
(dt2 <- dcast.data.table(dt1, id ~ x, value.var = "y"))
#> id a b c
#> <int> <list> <list> <list>
#> 1: 1 A
#> 2: 2 B
#> 3: 3 C
str(dt2)
#> Classes 'data.table' and 'data.frame': 3 obs. of 4 variables:
#> $ id: int 1 2 3
#> $ a :List of 3
#> ..$ : chr "A"
#> ..$ : NULL
#> ..$ : NULL
#> $ b :List of 3
#> ..$ : NULL
#> ..$ : chr "B"
#> ..$ : NULL
#> $ c :List of 3
#> ..$ : NULL
#> ..$ : NULL
#> ..$ : chr "C"
#> - attr(*, ".internal.selfref")=<externalptr>
#> - attr(*, "sorted")= chr "id" library(data.table) # PR
options(datatable.print.class = TRUE)
dt1 <- data.table(
id = 1:3,
x = c("a", "b", "c"),
y = list("A", "B", "C")
)
(dt2 <- dcast.data.table(dt1, id ~ x, value.var = "y"))
#> id a b c
#> <int> <list> <list> <list>
#> 1: 1 A NA NA
#> 2: 2 NA B NA
#> 3: 3 NA NA C
str(dt2)
#> Classes 'data.table' and 'data.frame': 3 obs. of 4 variables:
#> $ id: int 1 2 3
#> $ a :List of 3
#> ..$ : chr "A"
#> ..$ : logi NA
#> ..$ : logi NA
#> $ b :List of 3
#> ..$ : logi NA
#> ..$ : chr "B"
#> ..$ : logi NA
#> $ c :List of 3
#> ..$ : logi NA
#> ..$ : logi NA
#> ..$ : chr "C"
#> - attr(*, ".internal.selfref")=<externalptr>
#> - attr(*, "sorted")= chr "id" |
bbotk devs said they agree to make the change, mlr-org/bbotk#147 (comment) |
In an effort to merge this PR and make progress, I added I'm now remembering that using NA instead of NULL in list columns has come up before and we decided against it at the time here: #4198 (comment). Putting an example to what I had in mind there; i.e. NA logical could be a valid non-missing entry in a list column where the list column contains varying lengths of logical sequences :
That NA on the 2nd row represents a non-missing logical vector length 1 containing NA. However, that's just one edge case and not the big picture. We could still change to NA in list columns: it is being requested after all. But it would need a consultation exercise with users, an option to revert, and potentially-breaking-section in news. In the meantime, adding |
sounds like a great plan, thanks Matt. |
I found another edge case worth fixing for melt on data with a missing list column. Here is a data table with "missing" input columns (l_2 and num_3).
When we melt with na.rm=F we see that third row correctly contains NA but second row in the list column contains NULL (which is NOT treated as missing, inconsistent with other types).
Then when we melt with na.rm=T we see that third row is removed (correct) but second row is kept (incorrect),
I will investigate a fix.