Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing rowname specification in as.matrix.data.table #2692

Closed
sritchie73 opened this issue Mar 21, 2018 · 7 comments
Closed

Allowing rowname specification in as.matrix.data.table #2692

sritchie73 opened this issue Mar 21, 2018 · 7 comments

Comments

@sritchie73
Copy link
Contributor

Particularly after performing dcast(), I frequently find myself writing and using the following function to convert a data.table to a matrix:

dt.to.matrix <- function(x) {
  x <- as.data.frame(x)
  rownames(x) <- x[,1]
  x <- as.matrix(x[,-1])
  x
}

data.tables do not have a rownames attribute so this information is typically stored as the first column of the data.table. When converting to a matrix it is typically desirable to make this column the rownames() on the matrix. Currently, you have to jump through several hoops to make this conversion following the code above.

This could be taken care of by as.matrix.data.table() itself, e.g. through an additional argument something like as.matrix(dt, rownames = 1), analogous to the keep.rownames argument in as.data.table.

Is there isn't an obvious reason why this is a bad idea (can additional argument be added to S3 methods?) I'm happy to put together and submit a pull request.

@franknarf1
Copy link
Contributor

And/or maybe port reshape2::acast as was done for dcast .. ?

@mattdowle
Copy link
Member

mattdowle commented Mar 21, 2018

I see where you're coming from, from an abstract point of view. But data.table's don't have rownames because this information is typically stored in a multi-column multi-type key, which doesn't have to be the first column. Keys are so superior to rownames, that I don't see why you'd want to convert to a matrix really. Also, a matrix is of a single type for all columns. So the fact that a data.table can be converted to matrix at all means that all the columns in the data.table are the same type. It should either have been a matrix in the first place, or a tall and skinny data.table rather than short and fat, perhaps.

What do you do with the matrix once you've got it?

I don't have any objection to as.matrix.data.table being extended like that, but I wonder if it would make the wrong solution easier to do. Would it paste together the key into one longer string to go in the rownames? The horror of that approach was one reason I created data.table without rownames, but multi-column multi-type keys instead.

@sritchie73
Copy link
Contributor Author

I generally prefer to work with long skinny data.tables even with matrix data, then convert back to matrices as needed.

A common workflow for me is to:

  1. Convert a matrix into a data.table so that I can:
  2. Melt it to a tall and skinny data.table to:
  3. Run calculations over each column using by rather than apply(mat, 2, ...)
  4. Plot many columns against each other using ggplot2 by treating them as groups

I often also have multiple columns of values in a tall skinny data.table (e.g. a raw data column and a normalised data column) that I might want to split off into individual matrices using dcast.

Another scenario is where the raw data comes in a mixed matrix / data.table format, where there are several columns of information, and many columns of measurements – this I would split off into a data.table of information and a matrix of measurements.

I therefore wouldn't go so far as allowing as.matrix to work on multi-key data.tables, since it would be difficult to split those apart again later. Rather I was thinking the following behaviour:

  • as.matrix(dt, rownames=TRUE): take the first column as the rownames (or maybe key(dt) if there is a single key column).
  • as.matrix(dt, rownames=3): take the column index specified as the rownames.
  • as.matrix(dt, rownames="colname"): take the named column as the rownames.

We might also consider adding the same argument to as.data.frame(), although I cannot think of a scenario where that would be useful off the top of my head.

@MichaelChirico
Copy link
Member

MichaelChirico commented Mar 22, 2018 via email

@mattdowle
Copy link
Member

Thanks for info. I see now. Sounds good, then!

@jangorecki
Copy link
Member

@sritchie73 Minimal example of your workflow to split dimensions into data.table and measures into matrix (+calculate and join back) could be put as test.

@mattdowle
Copy link
Member

Merged PR should have auto-closed this one but didn't because the PR had "Implements" at the top not "Closes". Closing manually now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants