Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion - data.table and record types #4910

Open
DavisVaughan opened this issue Feb 18, 2021 · 4 comments
Open

Discussion - data.table and record types #4910

DavisVaughan opened this issue Feb 18, 2021 · 4 comments
Labels
non-atomic column e.g. list columns, S4 vector columns

Comments

@DavisVaughan
Copy link
Contributor

Hi data.table team!

I would like to start a discussion regarding a feature request of allowing record types as columns of a data.table. If you aren't familiar with the term, we define a record type as a classed list of equal length vectors, where the length() of the object is the length of the vectors, not the length of the list.

As of now, these aren't particularly common in R, but there is one example in base R, POSIXlt. I'm aware of the fact that POSIXlt is converted to POSIXct upon entry into a data.table, and I understand the reasons why you all do this. However, if you look beyond POSIXlt, I think that record types can be a powerful way to convey a lot of meaning into a single vector.

As an example, I've developed a new package called clock that makes heavy use of these record types. But of course, they don't work as columns of a data.table:

library(clock)
library(data.table)

x <- year_month_day(2019, 1:3)
x
#> <year_month_day<month>[3]>
#> [1] "2019-01" "2019-02" "2019-03"
unclass(x)
#> $year
#> [1] 2019 2019 2019
#> 
#> $month
#> [1] 1 2 3
#> 
#> attr(,"precision")
#> [1] 2

y <- duration_milliseconds(c(1e9, 10))
y
#> <duration<millisecond>[2]>
#> [1] 1000000000 10
unclass(y)
#> $ticks
#> [1] 11  0
#> 
#> $ticks_of_day
#> [1] 49600     0
#> 
#> $ticks_of_second
#> [1]  0 10
#> 
#> attr(,"precision")
#> [1] 8


data.table(x = x)
#> Error in dimnames(x) <- dn: length of 'dimnames' [1] not equal to array extent
data.table(y = y)
#> Error in dimnames(x) <- dn: length of 'dimnames' [1] not equal to array extent

data.frame(x = x)
#>         x
#> 1 2019-01
#> 2 2019-02
#> 3 2019-03
data.frame(y = y)
#>            y
#> 1 1000000000
#> 2         10

clock builds on the vctrs_rcrd type from the vctrs package. That type provides a lot of S3 method scaffolding to make it easier to create new record types on top of it. Because it is now much more straightforward to construct a record type in R, I feel that more might start appearing in the wild over the next few years.

I realize that this would probably be a lot of work. In the tidyverse, it was much easier to add support for these types once we added support for columns of a data frame that are also data frames (df-cols, for short). Record types can be thought of in a similar way, and often use the same underlying code when ordering, slicing, or comparing instances of them.

If you do think that this is worth pursuing, I am happy to discuss this further!

@eddelbuettel
Copy link
Contributor

FWIW durations are already supported:

> library(data.table)
data.table 1.13.6 using 6 threads (see ?getDTthreads).  Latest news: r-datatable.com
> library(nanotime)
> nd <- nanoduration(hours=0, minutes=0, seconds=0, nanoseconds=c(1, 1e3, 1e6))
> nd
[1] 00:00:00.000_000_001 00:00:00.000_001     00:00:00.001        
> data.table(nd=nd)
                     nd
1: 00:00:00.000_000_001
2:     00:00:00.000_001
3:         00:00:00.001
> 

@tlapak
Copy link
Contributor

tlapak commented Feb 24, 2021

Can I refer you to the discussion in #4415? As I understand it , the storage model is only slightly different for these record types, where everything is wrapped in a list as opposed to a vector plus its attributes, which in turn are a pair-list. So the handling of record types would almost be identical to that of an S4 object that doesn't include an atomic type (so an S4SXP). On that note, is there some discussion on why you would choose this type of object over how S4 objects are implemented? Just for my curiosity.

The long and short of it is that data.table is designed to handle atomic vectors and it heavily relies on how those are implemented. Handling more complex data structures requires careful thought and some extra work. A few months ago, I had actually started tinkering with something along the lines of what I outlined in the linked discussion, since it's been frequently requested. It's certainly non-trivial and there are a many places you'd need to take care of these complexities.

@eddelbuettel Your nanodurations are supported because you store them as a single vector/a simple S4 object with only one slot so there's really nothing to support (outside of int64, which is supported but a separate concern). So they are really different from these record types.

@eddelbuettel
Copy link
Contributor

@tlapak I was mostly just illustrating that storing a (high-precision) duration already works and does not require reengineering data.table. No more, no less. Similarly the other motivating example of a date is a little underwhelming. But hey we all have our windmills to fight. Thanks for the pointer to #4415.

@tlapak
Copy link
Contributor

tlapak commented Feb 24, 2021

@eddelbuettel ah I see. Yes that makes sense. I should always just assume that you know what you're talking about anyway. I've kind of given up on trying to tell people that they should rethink what they're trying to do. At the end of the day I think there's some importance in what people want (for whatever reason) and what that means for the popularity of a package.

@tlapak tlapak added the non-atomic column e.g. list columns, S4 vector columns label Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-atomic column e.g. list columns, S4 vector columns
Projects
None yet
Development

No branches or pull requests

3 participants