Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDateTime could optionally track different time zones for different rows #3160

Closed
MichaelChirico opened this issue Nov 26, 2018 · 4 comments
Closed

Comments

@MichaelChirico
Copy link
Member

simple base R time objects (POSIXct) are ill-suited to handling regional datasets (tzone attribute is atomic); similarly, POSIXlt objects are overly broad (every component is a vector).

What's really needed is a structure that's just a little bit more general than that offered by IDateTime at the moment:

idate itime tz
17861 32415  1
17861 32415  2

Where I'm suppressing the print methods for each column to show the simple integer structure of each column.

idate and itime are as before, but the objects (optionally? NA by default?) gain a tz column stored as a factor whose labels name different OlsonNames() time zones, e.g.

tz = factor(c('Asia/Singapore', 'Asia/Jakarta'))

Related methods would then handle printing/formatting/etc appropriately accounting for the tz column.

Related: #2402, #1451

@MichaelChirico MichaelChirico changed the title IDateTIme could optionally track different time zones for different rows IDateTime could optionally track different time zones for different rows Nov 26, 2018
@jangorecki
Copy link
Member

Not sure if it make sense to focus on that. Complexity of handling different tz for different rows is quite significant, all operators would need to handle that. I found most reliable and straightforward to just dump all times to UTC at the beginning and apply tz at the end when producing report.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Nov 26, 2018

So far I've been doing things by = tz; "producing a report" is exactly the sort of functionality I have in mind, but also, filtering rows based on local time (x[hour_local == 9L], e.g.).

Storage will remain UTC-based integer representation throughout, only functions like wday and format would be affected.

Functionality I have in in mind handles this logic internally to be less of a fuss. I'll add Low label since it's a nice-to-have.

@mattdowle
Copy link
Member

mattdowle commented Jul 14, 2020

If there are mixed timezones in an input dataset, it's best practice to convert everything to consistent UTC up front. Leaving them mixed in the same column is going to lead to bugs really quickly, slow it down even if it works, and at best result in a new class that can't be passed to other packages.
If the use case is really to do something like x[hour_local == 9L] then a column can be added for that by the user using plain integer, or ITime or similar. The tz column can be added by users in their dataset just as you showed, too. I don't think it needs to be built into data.table, or would be advantageous at all to build it into data.table. I have seen datasets where the local datetime and the UTC datetime are two separate columns; both are in UTC but the column name of the local time one (e.g. timestamp_local) is clear that it really represents local time. That way queries are a bit clearer because you can see what the timestamp is from the column name without needing to look at its type.
If you agree, can we close this issue?

@MichaelChirico
Copy link
Member Author

I agree that it's a bit niche and hard to maintain. I do think it's quite useful e.g. for building regression models, but I guess the main potential advantage -- skipping over POSIXlt conversion on a huge dataset -- wouldn't really be brought by the proposed change, at least without substantial effort & underlying code.

It's worth mentioning that both PostgreSQL and Presto support the TIMESTAMP WITH TIMEZONE column type, which can have timestamps vary by row, which I think provides good evidence of its utility. I believe the implementation is similar to R, where the core storage type is related to UTC, and the time zone really only comes into play for formatting & extractions.

I'll close this with a note that it would be useful for a different add-on package to support such a thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants