Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as.data.table.list centralized #3471

Merged
merged 17 commits into from
Jun 14, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,8 @@

15. `DT[order(col)[1:5], ...]` (i.e. where `i` is a compound expression involving `order()`) is now optimized to use `data.table`'s multithreaded `forder`, [#1921](https://github.com/Rdatatable/data.table/issues/1921). This example is not a fully optimal top-N query since the full ordering is still computed. The improvement is that the call to `order()` is computed faster for any `i` expression using `order`.

16. `as.data.table` now unpacks columns in a `data.frame` which are themselves a `data.frame`. This need arises when parsing JSON, a corollary in [#3369](https://github.com/Rdatatable/data.table/issues/3369#issuecomment-462662752). `data.table` does not allow columns to be objects which themselves have columns (such as `matrix` and `data.frame`), unlike `data.frame` which does. Bug fix 19 in v1.12.2 (see below) added a helpful error (rather than segfault) to detect such invalid `data.table`, and promised that `as.data.table()` would unpack these columns in the next release (i.e. this release) so that the invalid `data.table` is not created in the first place.

#### BUG FIXES

1. `first`, `last`, `head` and `tail` by group no longer error in some cases, [#2030](https://github.com/Rdatatable/data.table/issues/2030) [#3462](https://github.com/Rdatatable/data.table/issues/3462). Thanks to @franknarf1 for reporting.
Expand Down Expand Up @@ -134,7 +136,23 @@

20. `c`, `seq` and `mean` of `ITime` objects now retain the `ITime` class via new `ITime` methods, [#3628](https://github.com/Rdatatable/data.table/issues/3628). Thanks @UweBlock for reporting. The `cut` and `split` methods for `ITime` have been removed since the default methods work, [#3630](https://github.com/Rdatatable/data.table/pull/3630).

20. `as.data.table.array` now handles the case when some of the array's dimension names are `NULL`, [#3636](https://github.com/Rdatatable/data.table/issues/3636).
21. `as.data.table.array` now handles the case when some of the array's dimension names are `NULL`, [#3636](https://github.com/Rdatatable/data.table/issues/3636).

22. Adding a `list` column using `cbind`, `as.data.table`, or `data.table` now works rather than treating the `list` as if it were a set of columns, [#3471](https://github.com/Rdatatable/data.table/pull/3471). However, please note that using `:=` to add columns is preferred.

```R
# cbind( data.table(1:2), list(c("a","b"),"a") )
# V1 V2 NA # v1.12.2 and before
# <int> <char> <char> # introduced invalid NA column name too
# 1: 1 a a
# 2: 2 b a
#
# V1 V2 # v1.12.4+
# <int> <list>
# 1: 1 a,b
# 2: 2 a
```


#### NOTES

Expand Down Expand Up @@ -231,7 +249,7 @@

18. `cbind` with a null (0-column) `data.table` now works as expected, [#3445](https://github.com/Rdatatable/data.table/issues/3445). Thanks to @mb706 for reporting.

19. Subsetting does a better job of catching a malformed `data.table` with error rather than segfault. A column may not be NULL, nor may a column be an object such as a data.frame or matrix which have columns. Thanks to a comment and reproducible example in [#3369](https://github.com/Rdatatable/data.table/issues/3369) from Drew Abbot which demonstrated the issue which arose from parsing JSON. The next release will enable `as.data.table` to unpack columns which are data.frame to support this use case.
19. Subsetting does a better job of catching a malformed `data.table` with error rather than segfault. A column may not be NULL, nor may a column be an object which has columns (such as a `data.frame` or `matrix`). Thanks to a comment and reproducible example in [#3369](https://github.com/Rdatatable/data.table/issues/3369) from Drew Abbot which demonstrated the issue which arose from parsing JSON. The next release will enable `as.data.table` to unpack columns which are `data.frame` to support this use case.

#### NOTES

Expand Down
106 changes: 68 additions & 38 deletions R/as.data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -115,49 +115,75 @@ as.data.table.array = function(x, keep.rownames=FALSE, key=NULL, sorted=TRUE, va
ans[]
}

as.data.table.list = function(x, keep.rownames=FALSE, key=NULL, ...) {
wn = sapply(x,is.null)
if (any(wn)) x = x[!wn]
if (!length(x)) return( null.data.table() )
# fix for #833, as.data.table.list with matrix/data.frame/data.table as a list element..
# TODO: move this entire logic (along with data.table() to C
for (i in seq_along(x)) {
dims = dim(x[[i]])
if (!is.null(dims)) {
ans = do.call("data.table", x)
setnames(ans, make.unique(names(ans)))
return(ans)
as.data.table.list = function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE, ...) {
n = length(x)
eachnrow = integer(n) # vector of lengths of each column. may not be equal if silent repetition is required.
eachncol = integer(n)
missing.check.names = missing(check.names)
for (i in seq_len(n)) {
xi = x[[i]]
if (is.null(xi)) next # eachncol already initialized to 0 by integer() above
if (!is.null(dim(xi)) && missing.check.names) check.names=TRUE
if ("POSIXlt" %chin% class(xi)) {
warning("POSIXlt column type detected and converted to POSIXct. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.")
xi = x[[i]] = as.POSIXct(xi)
} else if (is.matrix(xi) || is.data.frame(xi)) {
if (!is.data.table(xi)) {
xi = x[[i]] = as.data.table(xi, keep.rownames=keep.rownames) # we will never allow a matrix to be a column; always unpack the columns
}
# else avoid dispatching to as.data.table.data.table (which exists and copies)
} else if (is.table(xi)) {
xi = x[[i]] = as.data.table.table(xi, keep.rownames=keep.rownames)
} else if (is.function(xi)) {
xi = x[[i]] = list(xi)
}
eachnrow[i] = NROW(xi) # for a vector (including list() columns) returns the length
eachncol[i] = NCOL(xi) # for a vector returns 1
}
n = vapply(x, length, 0L)
mn = max(n)
x = copy(x)
idx = which(n < mn)
if (length(idx)) {
for (i in idx) {
# any is.null(x[[i]]) were removed above, otherwise warning when a list element is NULL
if (inherits(x[[i]], "POSIXlt")) {
warning("POSIXlt column type detected and converted to POSIXct. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.")
x[[i]] = as.POSIXct(x[[i]])
}
# Implementing FR #4813 - recycle with warning when nr %% nrows[i] != 0L
if (!n[i] && mn)
warning("Item ", i, " is of size 0 but maximum size is ", mn, ", therefore recycled with 'NA'")
else if (n[i] && mn %% n[i] != 0L)
warning("Item ", i, " is of size ", n[i], " but maximum size is ", mn, " (recycled leaving a remainder of ", mn%%n[i], " items)")
x[[i]] = rep(x[[i]], length.out=mn)
ncol = sum(eachncol) # hence removes NULL items silently (no error or warning), #842.
if (ncol==0L) return(null.data.table())
nrow = max(eachnrow)
ans = vector("list",ncol) # always return a new VECSXP
recycle = function(x, nrow) {
if (length(x)==nrow) {
return(copy(x))
# This copy used to be achieved via .Call(CcopyNamedInList,x) at the top of data.table(). It maintains pre-Rv3.1.0
# behavior, for now. See test 548.2. The copy() calls duplicate() at C level which (importantly) also expands ALTREP objects.
# TODO: port this as.data.table.list() to C and use MAYBE_REFERENCED(x) || ALTREP(x) to save some copies.
# That saving used to be done by CcopyNamedInList but the copies happened again as well, so removing CcopyNamedInList is
# not worse than before, and gets us in a better centralized place to port as.data.table.list to C and use MAYBE_REFERENCED
# again in future.
}
if (identical(x,list())) vector("list", nrow) else rep(x, length.out=nrow) # new objects don't need copy
}
# fix for #842
if (mn > 0L) {
nz = which(n > 0L)
xx = point(vector("list", length(nz)), seq_along(nz), x, nz)
if (!is.null(names(x)))
setattr(xx, 'names', names(x)[nz])
x = xx
vnames = character(ncol)
k = 1L
for(i in seq_len(n)) {
xi = x[[i]]
if (is.null(xi)) next
if (eachnrow[i]>1L && nrow%%eachnrow[i]!=0L) # in future: eachnrow[i]!=nrow
warning("Item ", i, " has ", eachnrow[i], " rows but longest item has ", nrow, "; recycled with remainder.")
if (eachnrow[i]==0L && nrow>0L && is.atomic(xi)) # is.atomic to ignore list() since list() is a common way to initialize; let's not insist on list(NULL)
warning("Item ", i, " has 0 rows but longest item has ", nrow, "; filled with NA") # the rep() in recycle() above creates the NA vector
if (is.data.table(xi)) { # matrix and data.frame were coerced to data.table above
# vnames[[i]] = names(xi) #if (nm!="" && n>1L) paste(nm, names(xi), sep=".") else names(xi)
for (j in seq_along(xi)) {
ans[[k]] = recycle(xi[[j]], nrow)
vnames[k] = names(xi)[j]
k = k+1L
}
} else {
nm = names(x)[i]
vnames[k] = if (length(nm) && !is.na(nm) && nm!="") nm else paste0("V",i)
ans[[k]] = recycle(xi, nrow)
k = k+1L
}
}
setDT(x, key=key) # copy ensured above; also, setDT handles naming
x
if (any(vnames==".SD")) stop("A column may not be called .SD. That has special meaning.")
if (check.names) vnames = make.names(vnames, unique=TRUE)
setattr(ans, "names", vnames)
setDT(ans, key=key) # copy ensured above; also, setDT handles naming
ans
}

# don't retain classes before "data.frame" while converting
Expand All @@ -180,6 +206,10 @@ as.data.table.data.frame = function(x, keep.rownames=FALSE, key=NULL, ...) {
setnames(ans, 'rn', keep.rownames[1L])
return(ans)
}
if (any(!sapply(x,is.atomic))) {
# a data.frame with a column that is data.frame needs to be expanded; test 2013.4
return(as.data.table.list(x, keep.rownames=keep.rownames, ...))
}
ans = copy(x) # TO DO: change this deep copy to be shallow.
setattr(ans, "row.names", .set_row_names(nrow(x)))

Expand Down
Loading