Discussion - data.table and record types #4910

DavisVaughan · 2021-02-18T20:26:35Z

Hi data.table team!

I would like to start a discussion regarding a feature request of allowing record types as columns of a data.table. If you aren't familiar with the term, we define a record type as a classed list of equal length vectors, where the length() of the object is the length of the vectors, not the length of the list.

As of now, these aren't particularly common in R, but there is one example in base R, POSIXlt. I'm aware of the fact that POSIXlt is converted to POSIXct upon entry into a data.table, and I understand the reasons why you all do this. However, if you look beyond POSIXlt, I think that record types can be a powerful way to convey a lot of meaning into a single vector.

As an example, I've developed a new package called clock that makes heavy use of these record types. But of course, they don't work as columns of a data.table:

library(clock)
library(data.table)

x <- year_month_day(2019, 1:3)
x
#> <year_month_day<month>[3]>
#> [1] "2019-01" "2019-02" "2019-03"
unclass(x)
#> $year
#> [1] 2019 2019 2019
#> 
#> $month
#> [1] 1 2 3
#> 
#> attr(,"precision")
#> [1] 2

y <- duration_milliseconds(c(1e9, 10))
y
#> <duration<millisecond>[2]>
#> [1] 1000000000 10
unclass(y)
#> $ticks
#> [1] 11  0
#> 
#> $ticks_of_day
#> [1] 49600     0
#> 
#> $ticks_of_second
#> [1]  0 10
#> 
#> attr(,"precision")
#> [1] 8


data.table(x = x)
#> Error in dimnames(x) <- dn: length of 'dimnames' [1] not equal to array extent
data.table(y = y)
#> Error in dimnames(x) <- dn: length of 'dimnames' [1] not equal to array extent

data.frame(x = x)
#>         x
#> 1 2019-01
#> 2 2019-02
#> 3 2019-03
data.frame(y = y)
#>            y
#> 1 1000000000
#> 2         10

clock builds on the vctrs_rcrd type from the vctrs package. That type provides a lot of S3 method scaffolding to make it easier to create new record types on top of it. Because it is now much more straightforward to construct a record type in R, I feel that more might start appearing in the wild over the next few years.

I realize that this would probably be a lot of work. In the tidyverse, it was much easier to add support for these types once we added support for columns of a data frame that are also data frames (df-cols, for short). Record types can be thought of in a similar way, and often use the same underlying code when ordering, slicing, or comparing instances of them.

If you do think that this is worth pursuing, I am happy to discuss this further!

The text was updated successfully, but these errors were encountered:

eddelbuettel · 2021-02-19T02:39:47Z

FWIW durations are already supported:

> library(data.table)
data.table 1.13.6 using 6 threads (see ?getDTthreads).  Latest news: r-datatable.com
> library(nanotime)
> nd <- nanoduration(hours=0, minutes=0, seconds=0, nanoseconds=c(1, 1e3, 1e6))
> nd
[1] 00:00:00.000_000_001 00:00:00.000_001     00:00:00.001        
> data.table(nd=nd)
                     nd
1: 00:00:00.000_000_001
2:     00:00:00.000_001
3:         00:00:00.001
>

tlapak · 2021-02-24T12:16:41Z

Can I refer you to the discussion in #4415? As I understand it , the storage model is only slightly different for these record types, where everything is wrapped in a list as opposed to a vector plus its attributes, which in turn are a pair-list. So the handling of record types would almost be identical to that of an S4 object that doesn't include an atomic type (so an S4SXP). On that note, is there some discussion on why you would choose this type of object over how S4 objects are implemented? Just for my curiosity.

The long and short of it is that data.table is designed to handle atomic vectors and it heavily relies on how those are implemented. Handling more complex data structures requires careful thought and some extra work. A few months ago, I had actually started tinkering with something along the lines of what I outlined in the linked discussion, since it's been frequently requested. It's certainly non-trivial and there are a many places you'd need to take care of these complexities.

@eddelbuettel Your nanodurations are supported because you store them as a single vector/a simple S4 object with only one slot so there's really nothing to support (outside of int64, which is supported but a separate concern). So they are really different from these record types.

eddelbuettel · 2021-02-24T13:08:27Z

@tlapak I was mostly just illustrating that storing a (high-precision) duration already works and does not require reengineering data.table. No more, no less. Similarly the other motivating example of a date is a little underwhelming. But hey we all have our windmills to fight. Thanks for the pointer to #4415.

tlapak · 2021-02-24T13:28:23Z

@eddelbuettel ah I see. Yes that makes sense. I should always just assume that you know what you're talking about anyway. I've kind of given up on trying to tell people that they should rethink what they're trying to do. At the end of the day I think there's some importance in what people want (for whatever reason) and what that means for the popularity of a package.

DavisVaughan mentioned this issue Feb 18, 2021

Compatibility with data.table (see vignette) r-lib/clock#154

Closed

tlapak added the non-atomic column e.g. list columns, S4 vector columns label Feb 25, 2021

DavisVaughan mentioned this issue Feb 17, 2022

vec_size with proxy r-lib/vctrs#1539

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion - data.table and record types #4910

Discussion - data.table and record types #4910

DavisVaughan commented Feb 18, 2021

eddelbuettel commented Feb 19, 2021

tlapak commented Feb 24, 2021 •

edited

Loading

eddelbuettel commented Feb 24, 2021

tlapak commented Feb 24, 2021

Discussion - data.table and record types #4910

Discussion - data.table and record types #4910

Comments

DavisVaughan commented Feb 18, 2021

eddelbuettel commented Feb 19, 2021

tlapak commented Feb 24, 2021 • edited Loading

eddelbuettel commented Feb 24, 2021

tlapak commented Feb 24, 2021

tlapak commented Feb 24, 2021 •

edited

Loading