Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as.data.table.matrix keeps row names for null data tables #3149

Closed
mllg opened this issue Nov 16, 2018 · 6 comments
Closed

as.data.table.matrix keeps row names for null data tables #3149

mllg opened this issue Nov 16, 2018 · 6 comments
Milestone

Comments

@mllg
Copy link
Contributor

mllg commented Nov 16, 2018

Consider this example:

M = matrix(1:3, nrow = 3)
M = M[, integer(0)]

DT = as.data.table(M) # null data table

The resulting object DT is a "Null data.table" with dimensions c(0,0). This deviates from as.data.frame, but this is documented.

> nrow(DT)
[1] 0
> dim(DT)
[1] 0 0

Confusingly, the null data.table still has rownames:

rownames(DT)
[1] "1" "2" "3"

This is problematic, because the dimensions of a data.frame is calculated using the rownames:

> base::dim.data.frame(DT)
[1] 3 0

Of course there is a S3 method dim.data.table which returns c(0, 0). However, you get the wrong dimensions in C/C++ code.

I wonder if there is any advantage in keeping the row names. I first thought they are stored for the conversion of data.tables to data.frames (as.data.frame / setDF), but the row names seem to be ignored here, too.

@sritchie73
Copy link
Contributor

I can't replicate this behavior:

> library(data.table)
data.table 1.11.8  Latest news: r-datatable.com
> M = matrix(1:3, nrow = 3)
> M = M[, integer(0)]
> 
> DT = as.data.table(M) # null data table
> M
    
[1,]
[2,]
[3,]
> dim(M)
[1] 3 0
> as.data.table(M)
Null data.table (0 rows and 0 cols)
> dim(as.data.table(M))
[1] 0 0
> rownames(M)
NULL
> rownames(M) <- 1:3
> rownames(M)
[1] "1" "2" "3"
> dim(as.data.table(M))
[1] 0 0
> nrow(as.data.table(M))
[1] 0

@jaapwalhout
Copy link

jaapwalhout commented Nov 18, 2018

@sritchie73 You are doing something slightly different then @mllg imo

With data.table 1.11.8 i can replicate this:

> packageVersion("data.table")
[1] ‘1.11.8’

> M  <- matrix(1:3, nrow = 3)
> M <- M[, integer(0)]
> M
    
[1,]
[2,]
[3,]
> rownames(M)
NULL
> 
> DT <- as.data.table(M)
> dim(DT)
[1] 0 0
> rownames(DT)
[1] "1" "2" "3"
> base::dim.data.frame(DT)
[1] 3 0

I think the problem stems from the line of the code of the as.data.table.matrix-method (retreived with getAnywhere(as.data.table.matrix)):

function (x, keep.rownames = FALSE, ...) 
{
    if (!identical(keep.rownames, FALSE)) {
        ans = data.table(rn = rownames(x), x, keep.rownames = FALSE)
        if (is.character(keep.rownames)) 
            setnames(ans, "rn", keep.rownames[1L])
        return(ans)
    }
    d <- dim(x)
    nrows <- d[1L]
    ncols <- d[2L]
    ic <- seq_len(ncols)
    value <- vector("list", ncols)
    if (mode(x) == "character") {
        for (i in ic) value[[i]] <- x[, i]
    }
    else {
        for (i in ic) value[[i]] <- as.vector(x[, i])
    }
    col_labels <- dimnames(x)[[2L]]
    if (length(col_labels) == ncols) {
        if (any(empty <- !nzchar(col_labels))) 
            col_labels[empty] <- paste0("V", ic[empty])
        setattr(value, "names", col_labels)
    }
    else {
        setattr(value, "names", paste0("V", ic))
    }
    setattr(value, "row.names", .set_row_names(nrows))
    setattr(value, "class", c("data.table", "data.frame"))
    alloc.col(value)
}

As you can see, in the third line from the end the rownames are set with the number of rows of the matrix M (which is 3 in this example).

A possible solution might be to set the nrows-parameter in the as.data.table.matrix-method with nrows <- d[1L] * (d[2L] > 0L) instead of just nrows <- d[1L].

@sritchie73
Copy link
Contributor

Wouldn't it make more sense to just grab the number of rows from the new data.table? I.e. change the fourth last line to:

setattr(value, "row.names", .set_row_names(nrow(value)))

@MichaelChirico
Copy link
Member

interesting that M does indeed have a shape...

dput(M)
# structure(integer(0), .Dim = c(3L, 0L))

But as data.table is a columnar object, we can tell at this line:

value <- vector("list", ncols)

that the end result is going to be data.table(NULL).

I'm not 100% sure what the "right" result is, but if it's going to be a NULL data.table, might as well just do this:

ncols <- d[2L]
if (!ncols) return(null.data.table())
# ...

@jangorecki
Copy link
Member

@MichaelChirico suggestion was the simplest and well addressing the issue. I pushed it in PR and added few more tests for other edge cases so we are more future proof.

@sritchie73
Copy link
Contributor

This problem also arises with as.data.table.data.frame:

> library(data.table)
> mtcars[,0]
data frame with 0 columns and 32 rows
> DF <- mtcars[,0]
> DT <- as.data.table(DF)
> DT
Null data.table (0 rows and 0 cols)
> rownames(DT)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
[27] "27" "28" "29" "30" "31" "32"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants