Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s4 objects in j groups are wrong #4415

Open
MichaelChirico opened this issue May 3, 2020 · 3 comments
Open

s4 objects in j groups are wrong #4415

MichaelChirico opened this issue May 3, 2020 · 3 comments
Labels
bug non-atomic column e.g. list columns, S4 vector columns

Comments

@MichaelChirico
Copy link
Member

Another follow-up of #4131

j doesn't handle S4 objects properly:

library(data.table)
library(lubridate)
dt <- data.table(period_unit = c("days", "months", "weeks", "years"),
                 period_length = rep(1, times = 4))

dt[ , by = period_unit, do.call(.BY$period_unit, list(period_length))]
#    period_unit          V1
# 1:        days 1d 0H 0M 0S
# 2:      months 1d 0H 0M 0S
# 3:       weeks 1d 0H 0M 0S
# 4:       years 1d 0H 0M 0S

The days object has been cascaded to the other groups.

I think this is a duplicate of #1777, but I'm not sure; if so, I would close that one since I wrote this to be very direct & clear about the issue.

@tlapak
Copy link
Contributor

tlapak commented May 3, 2020

Writing a somewhat lengthy comment here because I've been wanting somewhere to put what I've read up on and this seems to be a useful spot.

I don't think this is a bug so much as a fundamental limitation of data.table. The only fully supported column types in data.tables are atomic vectors with scalar attributes1 with some limited support for list columns tacked on.

Anything else you can (mostly2) stick in in one piece and get it out again in one piece but you can't do anything else safely.

The last answer in #1777 basically explains what the issue is. It's not strictly an S4 thing but that's where it mostly crops up. Internally, simple S4 objects are stored as a vector for the first slot with all further slots stored as attributes of that vector.3 Now, data.table always operates on the underlying vector and assumes that all attributes are scalar.

Your example actually somewhat obscures the underlying issue. Replace the rep bit with 1:4 to get the exact same output. Also, the V1 column is actually malformed:

str(dt[ , by = period_unit, do.call(.BY$period_unit, list(period_length))][, V1])
# Formal class 'Period' [package "lubridate"] with 6 slots
#   ..@ .Data : num [1:4] 0 0 0 0
#   ..@ year  : num 0
#   ..@ month : num 0
#   ..@ day   : num 1
#   ..@ hour  : num 0
#   ..@ minute: num 0

All of the slots should be length four like the first one. (NB: .Data == second)

The following are all related (in no particular order and I don't make any claims that the list is exhaustive.): #2273, #4315, #4217, #2948, #1777, #3388.

The last one on that list demonstrates that the issues aren't technically S4 specific.

I've been having some idle thoughts on how one might tackle this but little time to actually work on it. In the meantime, a more prominent mention of the restrictions might be useful to have in the docs.

1 By scalar attribute I mean an attribute that is independent of the contents of the vector, such as a class attribute.
2 Things like Rle columns may yield at least unexpected results.
3 This may be considered an implemention detail but it's mentioned in the documentation and exposed at R level. It also seems to be pretty heavily baked into the code.

@MichaelChirico
Copy link
Member Author

Thanks, that's a great summary.

A big issue with fixing this is not just that the attributes are dynamic, but that it's hard to tell in general how to apply that for a given S4 class.

Can we rely on a c method, then call new on each row? and c(...) the result? Seems inefficient.

Is there a standardized way to tell whether the class accepts vector input to new?

@tlapak
Copy link
Contributor

tlapak commented May 5, 2020

Apologies for the second wall of text...

Is there a standardized way to tell whether the class accepts vector input to new?

I don't think there is any reliable way. And in fact, such restrictions might not be properly enforced. The closest you could get would probably be attempting to parse c but well... There are basically no restrictions on what one can do.

You wouldn't need to call new. Falling back on c here to assemble the column would be enough. But as you said, it's not exactly efficient.

I would say that objects that don't implement c and [ have no business being put in a data.table but even that doesn't give you much. You'd have no way of knowing when you'd need to fall back on them unless you just do it for all S4 objects. I see a few issues with that approach:

  1. It throws out the speed and memory efficiency.
  2. A lot of stuff around bioconductor goes to great lengths to implement and hide sparse representations. That might lead to unexpected results.
  3. The highly requested SF library is purely S3. But there are also other issues there.
  4. There's a lot more than just assembling or subsetting that users may come to expect.

I believe putting complicated data structures inside a data.table is fundamentally the wrong thing to do but a lot of people want to have it, so here we are.

I'll share my thoughts on how to provide native support for vector objects, just in case anyone's interested:

  1. Everything that's not explicitly white listed gets wrapped in list columns. (This is a broader thing I guess but would shore up edge cases.)
  2. Have a facility like cedta() where classes and their vector slots can be registered. Then you could either
    1. unpack and repack the objects when they go into/out of the data.table or
    2. pull out the additional vectors whenever you operate on them.

Either way, it would be a bit of an undertaking.

@tlapak tlapak added the non-atomic column e.g. list columns, S4 vector columns label Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug non-atomic column e.g. list columns, S4 vector columns
Projects
None yet
Development

No branches or pull requests

2 participants