Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for native parsing of iso8601 dates/timestamps in fread #4464

Merged
merged 28 commits into from
Jul 14, 2020

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented May 20, 2020

Very common use case for me -- fread string timestamps in ISO8601 format like 2020-05-01T03:14:18.343Z.

To do so, treat the parsing as a series of subparsing jobs -- StrtoI32 for year, month, day, hour, minute, timezone hour offset, timezone minute offset; then parse_double_regular for seconds. Hence splitting off the "core" of those functions as _core functions, and leaving the original as wrappers to feed the FieldParseContext correctly.

Somewhat related in intent to #1656 (smoothing/speeding up date/time I/O), but that issue would make round trips even faster (no conversion to character on output, and "simple" numeric parsing on input)

TODO:

  • Also parse timezone offsets like +07:00 +0630
  • Return class'd object, not just underlying value
  • Augment colClasses API to support type overrides from user
  • Add NEWS, documentation
  • tests
  • benchmark
  • add option to restore old behaviour for those who updgraded and need a quick fix without downgrading
  • fread("a,b\n2015-01-01,2015-01-01", colClasses="POSIXct") reads as character because direct POSIXct requires time to be present (use verbose=TRUE to see the bumps from type A to B to C). Relax direct POSIXct parser to allow date-only.
  • I don' think test 2150.15 should be passing, but it is because something in our test environment (test.data.table()?) appears to be setting locale timezone to UTC (more investigation needed). Done (it was prior test 2124) : e89d159
  • the real potential break with previous versions of data.table, is that for those using colClasses="POSIXct" or as.POSIXct() after the fread call, datetime values are now interpreted as UTC, not local timezone as as.POSIXct does by default. The news item gives focus to the tzone attribute of the resultant POSIXct column, but for those usages, there will be a silent shift in interpretation; i.e. they might be pleased after upgrade that they had POSIXct before and they still have POSIXct (just much faster) with no code changes, but the datetimes will have shifted. For those using fasttime after their fread calls, there will be no shift. In short, we need to convey the interpretation of datetime values will change for those using colClasses="POSIXct". Perhaps a warning for calls to fread which have colClasses=POSIXct, or a way to create a warning so those calls can be controlled. Or, extend the parser to interpret the datetime in local timezone, so that there is no break from previous versions of data.table, nor base R defaults. Sadly, I know that's hard. fasttime doesn't do local time for example, which may be why it was never promoted to R. Update: now reads UTC-marked datetime as written by fwrite by default, and unmarked datetime still get read by as.POSIXct in local time.
  • last 3 test fails. All relate to date-only being read as POSIXct by colClasses="POSIXct"

@codecov
Copy link

codecov bot commented May 20, 2020

Codecov Report

Merging #4464 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #4464    +/-   ##
========================================
  Coverage   99.61%   99.61%            
========================================
  Files          73       73            
  Lines       14119    14228   +109     
========================================
+ Hits        14064    14173   +109     
  Misses         55       55            
Impacted Files Coverage Δ
R/fread.R 100.00% <100.00%> (ø)
R/test.data.table.R 100.00% <100.00%> (ø)
src/fread.c 99.54% <100.00%> (+0.02%) ⬆️
src/freadR.c 100.00% <100.00%> (ø)
src/init.c 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 588e072...318a605. Read the comment docs.

@jangorecki
Copy link
Member

jangorecki commented May 20, 2020

this is very related to #1656, the inverse of it?
does that iso number is the same as "unix epoch" time?

@MichaelChirico
Copy link
Member Author

You're right this doesn't close #1656 per se, it's just very related.

#1656 wants to fwrite to use the underlying numeric representation of POSIXct columns, and fread offer a way for users to signal "this numeric column, apply POSIXct at the end". It's notable I think that this one basically cannot be auto-detected (unless we want to turn into Excel and produce a ton of false positives 😃).

This PR offers a different way of producing POSIXct columns from fread via string parsing.

Would need to think a bit more about how to implement both at the same time...

@MichaelChirico
Copy link
Member Author

The ISO part refers to the string formatting:

https://en.wikipedia.org/wiki/ISO_8601

IINM the relation to unixtime (=epoch time) is an implementation detail.

Same in intent to Python's datetime.fromisoformat or Presto's from_iso8601_timestamp/from_iso8601_date. SparkQL's timestamp() CAST wrapper also recognizes this format.

inst/tests/tests.Rraw Outdated Show resolved Hide resolved
@@ -10771,9 +10773,9 @@ test(1743.241, fread("a,b,c\n2,2,f", colClasses = list(character="c", integer="b
test(1743.242, fread("a,b,c\n2,2,f", colClasses = c("integer", "integer", "factor"), drop="a"), data.table(b=2L, c=factor("f")))

## POSIXct
test(1743.25, fread("a,b,c\n2015-06-01 11:00:00,1,ae", colClasses=c("POSIXct","integer","character")), data.table(a=as.POSIXct("2015-06-01 11:00:00"),b=1L,c="ae"))
test(1743.25, fread("a,b,c\n2015-06-01 11:00:00,1,ae", colClasses=c("POSIXct","integer","character")), data.table(a=as.POSIXct("2015-06-01 11:00:00", tz='UTC'),b=1L,c="ae"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

three more tests broken because I forced tz='UTC' in the implementation.

this is making me reconsider that -- should we default to not setting 'tzone'?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this going to be a breaking change? maybe we could relax that, and keep old behavior for coming release and change it in next release?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i agree. not even 100% sure we should change the behavior -- if so we need to be very explicit that it's a break with base

@MichaelChirico
Copy link
Member Author

MichaelChirico commented May 21, 2020

Not really sure what to do for testing the colClasses API come to think of it. Drawing a blank...

@MichaelChirico
Copy link
Member Author

MichaelChirico commented May 21, 2020

Here's a benchmark:

library(data.table)
library(fasttime)
fastIDate = function(x) as.IDate(fastPOSIXct(x))

# comparison to fasttime:
#  (1) truncates down (milliseconds not supported)
#  (2) doesn't support TZ offsets
#  (3) doesn't support times before 1970-01-01

NN = 1e7
rdate = format(.Date(sample(-20000:20000, NN, TRUE)))
rts = .POSIXct(runif(NN, -2*1590029545, 2*1590029545), tz = 'UTC')
rtz = sample(OlsonNames(), NN, TRUE)
rtime_iso8601 = format(rts, format = '%FT%H:%M:%OSZ')
rtime_utc = format(rts, format = '%FT%T%z')

DT = data.table(rdate, rts, rtime_iso8601, rtime_utc, rtz)
DT[ , by = rtz, rtime_mixed := format(rts, format = "%FT%T%z", tz = .BY$rtz)]

f = tempfile()

fwrite(DT[ , 'rdate'], f)
microbenchmark::microbenchmark(
  times = 10L,
  fromChar = as.IDate(rdate),
  fasttime = fastIDate(rdate),
  fread = fread(f)$rdate
)
# Unit: milliseconds
#      expr                       min                        lq                      mean
#  fromChar 35882.7227990000028512441 39291.1388519999964046292 42673.7721455999999307096
#  fasttime  2081.4537019999997937703  2324.4893710000001192384  4764.3731585999994422309
#     fread   374.9090039999999817155   393.2138810000000148648   609.2863710000000310174
#                     median                        uq                      max neval
#  43118.8744290000031469390 44802.1115659999995841645 50509.715732999997271691    10
#   4846.0090810000001511071  5695.4341949999998178100  9010.839712999999392196    10
#    409.9370655000000169821   848.5971970000000510481  1055.998761000000058630    10

microbenchmark::microbenchmark(
  times = 10L,
  fromChar = as.IDate(rdate),
  fasttime = fastIDate(rdate),
  fwrite_fread = {fwrite(data.table(rdate), f); fread(f)$rdate}
)
# Unit: milliseconds
#          expr                       min                        lq                      mean
#      fromChar 35845.2031050000005052425 39328.8093050000024959445 39655.5725804999965475872
#      fasttime  2132.8088490000000092550  4906.1557210000000850414  5918.7371296999999685795
#  fwrite_fread   879.3674630000000433938   914.6356719999999995707   984.0583596999999826949
#                     median                       uq                      max neval
#  39571.6334754999988945201 40353.395442000000912230 42844.634253999996872153    10
#   6160.8297330000004876638  6443.078268999999636435 12707.056249999999636202    10
#    943.1287370000000009895  1004.976440000000025066  1227.542063000000098327    10

fwrite(DT[ , 'rtime_iso8601'], f)
microbenchmark::microbenchmark(
  times = 10L,
  fromChar = as.POSIXct(rtime_iso8601, format = '%FT%H:%M:%OSZ', tz = 'UTC'),
  fasttime = fastPOSIXct(rtime_utc, tz = 'UTC'),
  fread = fread(f)$rtime_iso8601
)
# Unit: milliseconds
#      expr                       min                        lq                      mean
#  fromChar 53267.5157200000030570664 54396.0746129999970435165 57902.1551069999986793846
#  fasttime   374.4050359999999955107   407.7941289999999980864   438.6393107000000100015
#     fread   827.6846910000000434593   853.0744369999999889842  1561.1232625000000098225
#                     median                        uq                       max neval
#  57430.6017244999966351315 58453.2611570000008214265 69509.8057030000054510310    10
#    459.2325749999999970896   462.6702389999999809334   483.3714049999999815554    10
#    875.0970604999999977736  1743.7432160000000749278  5516.9704620000002250890    10

fwrite(DT[ , 'rtime_utc'], f)
microbenchmark::microbenchmark(
  times = 10L,
  fromChar = as.POSIXct(rtime_utc, format = '%FT%T%z', tz = 'UTC'),
  fasttime = fastPOSIXct(rtime_utc, tz = 'UTC'),
  fread = fread(f)$rtime_utc
)
# Unit: milliseconds
#      expr                      min                       lq                    mean
#  fromChar 6461.8654249999999592546 6520.2935799999995651888 8031.751119100000323670
#  fasttime  400.2528889999999819338  424.2101149999999734064 1599.063912200000004304
#     fread  887.4787450000000035288  895.9816120000000410073 1001.675952400000028319
#                    median                        uq                      max neval
#  6779.6675299999997150735 10312.0697949999994307291 11352.484152000000904081    10
#   459.7312014999999973952  3933.7474739999997837003  4715.532877999999982421    10
#   942.4043510000000196669   966.9638159999999516003  1627.955359000000044034    10

parse_mixed_time = function(x) {
  utc = as.POSIXct(substr(x, 1L, 19L), format = '%FT%T', tz = 'UTC')
  hour_offset = as.numeric(substr(x, 20L, 22L))
  minute_offset = as.numeric(substr(x, 23L, 24L))
  utc - 3600*hour_offset - 60*minute_offset
}
fasttimeMixed = function(x) {
  utc = fastPOSIXct(substr(x, 1L, 19L), tz = 'UTC')
  hour_offset = as.numeric(substr(x, 20L, 22L))
  minute_offset = as.numeric(substr(x, 23L, 24L))
  utc - 3600*hour_offset - 60*minute_offset
}
fwrite(DT[ , 'rtime_mixed'], f)
microbenchmark::microbenchmark(
  times = 10L,
  fromChar = parse_mixed_time(rtime_utc),
  fasttime = fasttimeMixed(rtime_utc),
  fread = fread(f)$rtime_mixed
)
# Unit: milliseconds
#      expr                       min                       lq                      mean
#  fromChar 15846.6480219999994005775 18350.009042000001500128 20452.6268201999992015772
#  fasttime  7104.5645469999999477295  7936.138523999999961234 11390.1426123999990522861
#     fread   301.6112739999999803331   326.873304000000018732   584.5890289999999822612
#                     median                        uq                     max neval
#  19270.2323180000021238811 21710.5543460000008053612 26949.67992500000036671    10
#   9304.2516670000004523899 14844.4976189999997586710 19816.84063099999912083    10
#    539.3522219999999833817   699.3033580000000029031  1381.02652299999999741    10

# restriction to since 1970-01-01 [fasttime validity]
fwrite(DT[rdate >= .Date(0), 'rdate'], f)
microbenchmark::microbenchmark(
  times = 10L,
  fromChar = as.IDate(rdate),
  fasttime = fastIDate(rdate),
  fread = fread(f)$rdate
)
# Unit: milliseconds
#      expr                       min                        lq                     mean
#  fromChar 34985.4997430000003078021 40414.9324960000012652017 42401.060834800002339762
#  fasttime  2408.1196479999998700805  6147.3126640000000406872  8354.270583600000463775
#     fread   192.6099480000000028213   206.9921659999999974389  1135.318660500000078173
#                     median                        uq                      max neval
#  42407.7967409999982919544 45023.2967720000015106052 49840.856562000000849366    10
#   8293.6175894999996671686 11095.8809689999998226995 12965.062056999999185791    10
#    439.6589885000000208493   580.2351089999999658176  4295.531719999999950232    10

fwrite(DT[rtime_iso8601 >= '1970-01-01', 'rtime_iso8601'], f)
microbenchmark::microbenchmark(
  times = 10L,
  fromChar = as.POSIXct(rtime_iso8601, format = '%FT%H:%M:%OSZ', tz = 'UTC'),
  fasttime = fastPOSIXct(rtime_utc, tz = 'UTC'),
  fread = fread(f)$rtime_iso8601
)
# Unit: milliseconds
#      expr                       min                        lq                     mean
#  fromChar 48567.7034510000012232922 49745.8525749999971594661 52436.273207099999126513
#  fasttime   411.2232799999999883767   449.2160329999999817119  1047.987805499999922176
#     fread   412.2872570000000109758   427.8790230000000178734   636.538313300000027084
#                     median                        uq                       max neval
#  52135.1380439999993541278 53402.8323680000030435622 61943.1473170000026584603    10
#    457.8400750000000130058   493.2715469999999982065  3636.4618930000001455483    10
#    442.7219159999999646971   956.0641170000000101936   997.2641069999999672291    10

# restriction to single thread

setDTthreads(1)
fwrite(DT[rdate >= '1970-01-01', 'rdate'], f)
microbenchmark::microbenchmark(
  times = 10L,
  fromChar = as.IDate(rdate),
  fasttime = fastIDate(rdate),
  fread = fread(f)$rdate
)
# Unit: milliseconds
#      expr                       min                       lq                      mean
#  fromChar 15846.6480219999994005775 18350.009042000001500128 20452.6268201999992015772
#  fasttime  7104.5645469999999477295  7936.138523999999961234 11390.1426123999990522861
#     fread   301.6112739999999803331   326.873304000000018732   584.5890289999999822612
#                     median                        uq                     max neval
#  19270.2323180000021238811 21710.5543460000008053612 26949.67992500000036671    10
#   9304.2516670000004523899 14844.4976189999997586710 19816.84063099999912083    10
#    539.3522219999999833817   699.3033580000000029031  1381.02652299999999741    10

fwrite(DT[rtime_iso8601 >= '1970-01-01', 'rtime_iso8601'], f)
microbenchmark::microbenchmark(
  times = 10L,
  fromChar = as.POSIXct(rtime_iso8601, format = '%FT%H:%M:%OSZ', tz = 'UTC'),
  fasttime = fastPOSIXct(rdate, tz = 'UTC'),
  fread = fread(f)$rtime_iso8601
)
# Unit: milliseconds
#      expr                       min                        lq                      mean
#  fromChar 53143.4188999999969382770 55750.3270540000012260862 58249.7907611999980872497
#  fasttime   231.2867149999999867305   268.5390479999999797656   341.1082888999999909174
#     fread   418.6337930000000255859   443.7617250000000126420   587.1848684000000275773
#                     median                        uq                       max neval
#  58863.8435734999948181212 61210.0342630000013741665 62211.9991979999977047555    10
#    352.1465210000000070067   393.4802530000000047039   436.1191180000000144901    10
#    467.6080809999999701176   815.8930179999999836582  1022.0670430000000123982    10

@MichaelChirico
Copy link
Member Author

MichaelChirico commented May 21, 2020

So in summary we blow base methods out of the water. fasttime is more competitive, but only for POSIXct output. I suspect the as.IDate step is the bottleneck there.

For POSIXct output, we do ~about the same when restricted to the range of valid fasttime inputs (1970-01-01 through 2199) -- fasttime "outperforms" on more general input, I assume because it just fails (returns NA) as soon as it detects a time outside that range. It also outperforms single-threaded fread.

fasttime is also not flexible enough to handle timezone offsets -- a wrapper that parses the offset & applies it slows down fasttime considerably.

@MichaelChirico
Copy link
Member Author

Just realized readr offers their own native parsing (for some reason I thought it was just applying as.Date ex-post). Here are side-by-side benchmarks vs readr:

library(data.table)
library(readr)

NN = 1e7
rdate = format(.Date(sample(-20000:20000, NN, TRUE)))
rts = .POSIXct(runif(NN, -2*1590029545, 2*1590029545), tz = 'UTC')
rtz = sample(OlsonNames(), NN, TRUE)
rtime_iso8601 = format(rts, format = '%FT%H:%M:%OSZ')
rtime_utc = format(rts, format = '%FT%T%z')

DT = data.table(rdate, rts, rtime_iso8601, rtime_utc, rtz)
DT[ , by = rtz, rtime_mixed := format(rts, format = "%FT%T%z", tz = .BY$rtz)]

f = tempfile()

fwrite(DT[ , 'rdate'], f)
microbenchmark::microbenchmark(
  times = 20L,
  readr = suppressMessages(read_csv(f, progress=FALSE)),
  fread = fread(f, showProgress=FALSE)
)
# Unit: milliseconds
#   expr       min        lq      mean    median        uq       max neval
#  readr 1386.4638 1438.7039 2034.6879 1545.2631 1655.4217 5036.6778    20
#  fread  106.1061  115.3648  142.5847  146.2171  158.1482  201.9532    20

fwrite(DT[ , 'rtime_iso8601'], f)
microbenchmark::microbenchmark(
  times = 20L,
  readr = suppressMessages(read_csv(f, progress=FALSE)),
  fread = fread(f, showProgress=FALSE)
)
# Unit: milliseconds
#   expr       min        lq      mean   median        uq      max neval
#  readr 2514.8418 2587.5590 3105.4354 2667.628 2982.5975 7262.127    20
#  fread  223.4602  228.2907  379.0271  235.442  259.4958 2483.842    20

fwrite(DT[ , 'rtime_utc'], f)
microbenchmark::microbenchmark(
  times = 20L,
  readr = suppressMessages(read_csv(f, progress=FALSE)),
  fread = fread(f, showProgress=FALSE)
)
# Unit: milliseconds
#   expr       min        lq      mean    median       uq       max neval
#  readr 3420.7929 3676.1125 4250.4507 3857.4463 4336.010 7842.4694    20
#  fread  248.9622  258.3923  295.5566  270.8553  313.008  434.5843    20

fwrite(DT[ , 'rtime_mixed'], f)
microbenchmark::microbenchmark(
  times = 20L,
  readr = suppressMessages(read_csv(f, progress=FALSE)),
  fread = fread(f, showProgress=FALSE)
)
# Unit: milliseconds
#   expr       min        lq      mean    median        uq       max neval
#  readr 3565.8941 3642.8500 4006.1577 3855.8094 4004.9712 7204.2305    20
#  fread  262.7315  275.2454  292.0794  280.8626  289.1907  380.9208    20

# restriction to single thread
setDTthreads(1)
fwrite(DT[ , 'rdate'], f)
microbenchmark::microbenchmark(
  times = 20L,
  readr = suppressMessages(read_csv(f, progress=FALSE)),
  fread = fread(f, showProgress=FALSE)
)
# Unit: milliseconds
#   expr      min        lq      mean    median        uq      max neval
#  readr 1372.249 1385.0602 1415.0192 1395.2615 1409.3363 1590.724    20
#  fread  378.426  382.4896  536.4007  385.2831  389.0673 3295.760    20

fwrite(DT[ , 'rtime_iso8601'], f)
microbenchmark::microbenchmark(
  times = 20L,
  readr = suppressMessages(read_csv(f, progress=FALSE)),
  fread = fread(f, showProgress=FALSE)
)
# Unit: milliseconds
#   expr       min        lq     mean    median       uq      max neval
#  readr 2562.9606 2597.7621 2837.967 2684.0372 3028.164 3725.742    20
#  fread  799.8739  813.6693 1336.382  907.6082 1123.258 5626.535    20

So "out of the box" (4 cores on my machine) this branch is about 10x faster. Single-threaded is still 3-5x faster.

@MichaelChirico MichaelChirico changed the title Add support for native parsing of iso8601 timestamps in fread Add support for native parsing of iso8601 dates/timestamps in fread May 22, 2020
@MichaelChirico
Copy link
Member Author

@statquant suggests we read timestamp columns as nanotime instead. In principle this is just a matter of converting the units to nanoseconds by multiplying everything by 1e9, being careful to keep things in the right range.

This could in principle be coupled with the integer64 argument to fread -- return nanotime by default, or turn it off and return POSIXct if integer64='double'. Any thoughts?

@jangorecki
Copy link
Member

Any POSIXct fields in csv should not be automatically turned into nanotime, but if a time field has higher precision than POSIXct offers, then make sense to automatically turn it into nanotime. The same way we do for numeric/integer64. And using existing argument to control that behavior looks fine for me.

@MichaelChirico
Copy link
Member Author

It's the opposite right?

For other fields, we try int -> int64 -> float; POSIXct doesn't really have a possibility for int (unless we try to map ITime first? I'd say not in scope as of now).

nanotime has range of -2^63 nanoseconds --> 2^63 nanoseconds:

.POSIXct(c(-1,1)*9223372036854775807/1e9, tz="UTC")
# [1] "1677-09-21 00:12:44.145223 UTC" "2262-04-11 23:47:16.854776 UTC"

i.e., we can represent nanoseconds exactly in that range -- if we encounter timestamps outside that range, then we bump to double.

@statquant
Copy link

What I was suggesting was that columns with precision up to millisecond would be converted to POSIXct and sub-milli to nanotime. Maybe an option could specify the behavior, {"auto", "POSIX", "nanotime"} where last 2 would just convert to POSIXct or nanotime

@mattdowle mattdowle added this to the 1.12.9 milestone Jun 21, 2020
@mattdowle
Copy link
Member

Thanks, @eddelbuettel. Yes that's the way I was leaning already and why I had added to the top comment: "Or, extend the parser to interpret the datetime in local timezone, so that there is no break from previous versions of data.table, nor base R defaults. Sadly, I know that's hard. fasttime doesn't do local time for example, which may be why it was never promoted to R."

@mattdowle
Copy link
Member

Regarding this that I put in top comment :

I don' think test 2150.15 should be passing, but it is because something in our test environment (test.data.table()?) appears to be setting locale timezone to UTC (more investigation needed).

It seems to be test 2124 and it's down to a material difference between TZ environment variable being unset vs empty, at least on my Linux box.

Test 2124 does this :

oldtz=Sys.getenv('TZ')
Sys.setenv(TZ='Asia/Jakarta') # UTC+7
...
Sys.setenv(TZ=oldtz)

and observe the following :

$ R --vanilla
Sys.getenv('TZ',unset=NA)
# [1] NA
Sys.getenv('TZ')
# [1] ""
as.POSIXct("2010-05-15 01:02:03")
# [1] "2010-05-15 01:02:03 MDT"
Sys.setenv(TZ="")
as.POSIXct("2010-05-15 01:02:03")
# [1] "2010-05-15 01:02:03 UTC"
Sys.unsetenv("TZ")
as.POSIXct("2010-05-15 01:02:03")
# [1] "2010-05-15 01:02:03 MDT"

So if TZ was unset before that test, TZ will be set to "" after the test. And TZ='""' appears to mean UTC, not local.
So that explains why 2150.15 is passing when it shouldn't: it's running under UTC due to the prior test setting TZ to "".
If a user runs test.data.table() when their TZ is unset, it'll be set to "", so that needs fixing anyway so that test.data.table() doesn't impact user environment.

@MichaelChirico
Copy link
Member Author

extend the parser to interpret the datetime in local timezone

This gets to a broader point that I also got stuck at when looking into #1162 (sep2) -- From a data structures point of view, how would one implement such a thing in fread? In particular I got stuck at making such infra R-agnostic so it can work with pydatatable.

As it relates to this issue, if we see 2020-01-01 01:02:03 Asia/Bahrain, I guess the return type would have to be a struct? One piece the parsed time-in-seconds, one piece the time zone string?

In any case I don't think it's something that we should try to address for now.

@eddelbuettel
Copy link
Contributor

how would one implement such a thing

I don't know either but I have e.g. been wrapping Boost date_time for a decade or so (first in RcppBDT, now in anytime) and it simply does not parse timezones, see its (long) docs page and scroll down about 1/6 or search for "(output only)". 😢

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jul 13, 2020

More directly for the issue at hand, I am a bit of two minds about time zones. I recognize base is trying very hard to figure out the system time zone, but I have in mind reproducibility, with what Kurt Hornik wrote about changing stringsAsFactors in mind.

If the system time zone is used, co-authors in different locales can conceivably get different downstream results on their respective machines. Hopefully as innocuous as a shifter but I'm not sure how badly that could spiral out in a complicated analysis (what if the analysis period goes over daylight savings time in one author's locale, but not another's? what about one locale where there was a permanent jump change in the time zone definition?). Obviously it's ideal if everyone is fastidious about setting time zones if their doing time series stuff, but certainly R-core agrees this type of thing is a big enough threat for it to be a major reason to change stringsAsFactors default. And certainly all of our production data at Grab is using UTC timestamps for a reason 😃

Lastly, for ISO8601 timestamps like 2020-01-02T03:04:05+04:30, that +04:30 is clearly relative to UTC time. And Zulu time (Z) is UTC. The original thrust of this PR was to read ISO8601 timestamps which AFAICT are UTC-centric; it just happened that reading other timestamps was trivially easy to include at the same time, leading to the ambiguity here.

That's my two cents on why I went with UTC as a default. I agree with the need to anticipate more potentially breaking code, especially as base R doesn't (currently?) accept as.POSIXct(<POSIXct>, tz='New/Timezone') as a syntax for changing time zone. So code that was using as.POSIXct to a non-system time zone may also be broken.

@MichaelChirico
Copy link
Member Author

it simply does not parse timezones

That's a bummer. When I started this PR I thought it would be out of scope and wanted to stick with Zulu-time (Z) only, but actually the standard +mm:nn and +mm versions are easy enough to handle. The real monster is parsing timezone names & figuring out what is/is not in OlsonNames() -- that could reasonably be left until the R level.

The next can of worms -- POSIXct only accepts one time zone per column. Hence #3160. For another day...

@mattdowle
Copy link
Member

mattdowle commented Jul 14, 2020

As it relates to this issue, if we see 2020-01-01 01:02:03 Asia/Bahrain, I guess the return type would have to be a struct? One piece the parsed time-in-seconds, one piece the time zone string?

This almost never occurs. We'll never support a string like Asia/Bahrain in the data like that. UTC offsets is all we need, as you've implemented already. The real issue is blank timezone; i.e. 2020-01-01 01:02:03. That is interpreted as local timezone by R by default, and as Dirk referred to too, that's what users expect and we need to maintain.

This is why fwrite writes UTC datetimes by default and it includes the Z character, unlike R's write.csv which writes timestamps in local time.

I'm nearly done with wrapping this up so it can be merged for this release.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jul 14, 2020

I think I missed something so just noting it here once I caught up.

Original proposal was UTC-dominant. Assumed any timestamp-alike on input was UTC time, so e.g.

2020-01-01 00:00:00

Is read as a time written in UTC, and parsed as UTC.

But maybe it is written in local time, then how should we parse? fasttime has a similar issue where it assumes every input string is UTC.

And in general if we don't know the timezone, it's hard to adjust ex-post anyway. Because of daylight savings, timezone jumps, etc, the UTC offset vs. "locale time" can be discontinuous. Instead the user has to rely on as.POSIXct to handle the heavy lifting in such cases.

I'm not even sure a tz argument to fread would help us here, unless we plan to shoehorn in all the complicated timezone logic at the parsing level -- the benefit is in allowing users to say tz='UTC' to force "fast parsing" on, then worry about time zone quirks on their own.

I do think it's common to find CSVs without timezone information there directly (e.g. this benchmark of arrow on the NYC taxi data set -- my guess is their CSV reader has some built in assumptions about UTC time) so it's unfortunate we can't help in this case (as we've seen such a big performance improvement because loading all those unique strings choke up the character cache). But hopefully production systems are emitting ISO8601 timestamps (certainly fwrite is).

…IXct afterwards to retain local time as before; all tests pass
@mattdowle
Copy link
Member

mattdowle commented Jul 14, 2020

I'm not even sure a tz argument to fread would help us here

Read the NYC data with fread( , tz="UTC") and set TZ env variable to "UTC". Done. Work away in UTC but it will feel like whatever timezone the data was written.
If the NYC data was written using local time, then it should exhibit the gap and overlap on EST/EDT changes which would be a data quality exercise to perform, and a problem indeed since that dataset is continuous 24 hours. But that's nothing to do with R, unless it was R that was used to create the file and its default local time was used. If it's local time, then you might be able to find some trips that took negative time when the clocks went back. I've seen datasets where some days are written in local and some days are written in UTC. Just due to mistakes (config changes) at the system writing the data. It takes a few days to get corrected but they don't go back and correct the history because that itself would be rewriting history (live systems saw the live data and responded accordingly, so you have to keep what was live). The point is, all that analysis and investigation can be done by reading the datetimes as UTC and going from there.

@MichaelChirico
Copy link
Member Author

all that analysis and investigation can be done by reading the datetimes as UTC and going from there.

In that case, shouldn't we plan to set tz='UTC' by default (maybe not this release, but next release?)

@mattdowle
Copy link
Member

mattdowle commented Jul 14, 2020

In that case, shouldn't we plan to set tz='UTC' by default (maybe not this release, but next release?)

No, because of backwards compatibility with past data.table versions (colClasses=POSIXct using as.POSIXct), and expectations over matching base R default. The R local time default is ok for people working in one time zone, saving their own data, and loading it up again themselves in their timezone. But for everyone writing and maintaining production systems, or working across timezones, we set the TZ environment variable and work in UTC. What we could do is default tz= to 'UTC" when the timezone of the R session is UTC. In that case local time is UTC so reading the unmarked datetimes as UTC is the same as reading as local. As this PR stands now, as.POSIXct would still be used when TZ is UTC, so that'd be nice to redirect to the direct parser.
Also, it's better code-quality to see tz="UTC" explicitly passed to fread call so that the reader knows the unmarked datetimes will be interpreted as UTC?
But we can ask the community and see what folk say. Dirk was the first voice and was against UTC by default for unmarked datetime. Personally, I'm not as strongly against it (data.table's raison d'etre is to be different from base R). fwrite is different to base and writes UTC by default, after all. The difference with fwrite is that is was different right from the start. In contrast, fread has been around for a while and it's all about colClasses=POSIXct in use already which uses as.POSIXct.

@mattdowle
Copy link
Member

mattdowle commented Jul 14, 2020

Ok all tests and coverage pass. I'm done and ready to merge. Ok by you, @MichaelChirico?
We can add tz= to fread as a follow up. Since that's not too hard, I'd say in this release. Default tz="" (local time) and allow tz="UTC" only, not any other timezone strings. tz= would only affect unmarked datetime. The tricky thing is that the TZ environment variable is different: "" means UTC and unset means local. So making the default tz= if (Sys.getenv("TZ", unset=NA) %chin% c("", "UTC")) "UTC" else "" looks a bit hard on the eye. The choice of Sys.getenv's default for unset being "" rather than NA, beats me. I can't imagine why that makes sense. Especially since TZ's "" vs unset is so important.
Anyway, merge PR as it stands now?

@MichaelChirico
Copy link
Member Author

OK by me. BTW I would use data.table:::is_utc rather than checking 'UTC' only

@MichaelChirico
Copy link
Member Author

What about tz=Sys.timezone() as the default? I know it's not an exact map to the default by at least it signals to the user where to look better than '' does

@mattdowle
Copy link
Member

mattdowle commented Jul 14, 2020

I see why is_utc is useful, but doesn't allowing so may alternatives open the door straight away to using a whole load of equivalent names? It's a new argument, so isn't it a good time to use one standard: UTC? Then in user code, when grep'ing for example, we know it's either tz="UTC" or tz="". Anything else would be an error explaining either "UTC" or "".
Sys.timezone() returns "America/Denver" for me. I agree the call Sys.timezone() conveys better as a default, but its value leads to having to deal with timezone strings like that. R uses tz=="" to mean local, so it's in keeping with that. But we could use tz="local" for better clarity. Or, call the argument utc=FALSE/TRUE, maybe.
There is something about invalid timezones being interpreted as UTC silenty, which can be infuriating when you don't know that's happening. If something is invalid I want to see an error! I say that because tz="local" is an invalid timezone and could be interpreted as UTC silently by something somewhere (and UTC would be the opposite of what was intended by "local").
Anyway, let's discuss the tz= default in follow up PR/issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change issues whose solution would require breaking existing behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants