-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion - data.table and record types #4910
Comments
FWIW durations are already supported: > library(data.table)
data.table 1.13.6 using 6 threads (see ?getDTthreads). Latest news: r-datatable.com
> library(nanotime)
> nd <- nanoduration(hours=0, minutes=0, seconds=0, nanoseconds=c(1, 1e3, 1e6))
> nd
[1] 00:00:00.000_000_001 00:00:00.000_001 00:00:00.001
> data.table(nd=nd)
nd
1: 00:00:00.000_000_001
2: 00:00:00.000_001
3: 00:00:00.001
> |
Can I refer you to the discussion in #4415? As I understand it , the storage model is only slightly different for these record types, where everything is wrapped in a list as opposed to a vector plus its attributes, which in turn are a pair-list. So the handling of record types would almost be identical to that of an S4 object that doesn't include an atomic type (so an S4SXP). On that note, is there some discussion on why you would choose this type of object over how S4 objects are implemented? Just for my curiosity. The long and short of it is that data.table is designed to handle atomic vectors and it heavily relies on how those are implemented. Handling more complex data structures requires careful thought and some extra work. A few months ago, I had actually started tinkering with something along the lines of what I outlined in the linked discussion, since it's been frequently requested. It's certainly non-trivial and there are a many places you'd need to take care of these complexities. @eddelbuettel Your nanodurations are supported because you store them as a single vector/a simple S4 object with only one slot so there's really nothing to support (outside of int64, which is supported but a separate concern). So they are really different from these record types. |
@tlapak I was mostly just illustrating that storing a (high-precision) duration already works and does not require reengineering data.table. No more, no less. Similarly the other motivating example of a date is a little underwhelming. But hey we all have our windmills to fight. Thanks for the pointer to #4415. |
@eddelbuettel ah I see. Yes that makes sense. I should always just assume that you know what you're talking about anyway. I've kind of given up on trying to tell people that they should rethink what they're trying to do. At the end of the day I think there's some importance in what people want (for whatever reason) and what that means for the popularity of a package. |
Hi data.table team!
I would like to start a discussion regarding a feature request of allowing record types as columns of a data.table. If you aren't familiar with the term, we define a record type as a classed list of equal length vectors, where the
length()
of the object is the length of the vectors, not the length of the list.As of now, these aren't particularly common in R, but there is one example in base R, POSIXlt. I'm aware of the fact that POSIXlt is converted to POSIXct upon entry into a data.table, and I understand the reasons why you all do this. However, if you look beyond POSIXlt, I think that record types can be a powerful way to convey a lot of meaning into a single vector.
As an example, I've developed a new package called clock that makes heavy use of these record types. But of course, they don't work as columns of a data.table:
clock builds on the
vctrs_rcrd
type from the vctrs package. That type provides a lot of S3 method scaffolding to make it easier to create new record types on top of it. Because it is now much more straightforward to construct a record type in R, I feel that more might start appearing in the wild over the next few years.I realize that this would probably be a lot of work. In the tidyverse, it was much easier to add support for these types once we added support for columns of a data frame that are also data frames (df-cols, for short). Record types can be thought of in a similar way, and often use the same underlying code when ordering, slicing, or comparing instances of them.
If you do think that this is worth pursuing, I am happy to discuss this further!
The text was updated successfully, but these errors were encountered: