Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread tz= default changed from "" to "UTC" #4894

Merged
merged 2 commits into from
Feb 6, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,16 @@

# data.table [v1.13.7](https://github.com/Rdatatable/data.table/milestone/20) (in development)

## POTENTIALLY BREAKING CHANGES

1. In v1.13.0 (July 2020) native parsing of datetime was added to `fread` by Michael Chirico which dramatically improved reading datetime. Before then datetime was read as character by default which was slow. Since v1.13.0, UTC-marked datetime (e.g. 2020-07-24T10:11:12.134Z where the final `Z` is present) has been read automatically as POSIXct and quickly. We provided the migration option `datatable.old.fread.datetime.character` to revert to the previous slow character behavior. We also added the `tz=` argument to control unmarked datetime; i.e. where the `Z` (or equivalent UTC postfix) is missing in the data. The default `tz=""` reads unmarked datetime as character as before, slowly. We gave you the ability to set `tz='UTC'` to turn on the new behavior and read unmarked datetime as UTC, quickly. R sessions that are running in UTC by setting the TZ environment variable, as is good practice and common in production, have also been reading unmarked datetime as UTC since v1.13.0, much faster. Note 1 of v1.13.0 (below in this file) ended "In addition to convenience, `fread` is now significantly faster in the presence of dates, UTC-marked datetimes, and unmarked datetime when tz="UTC" is provided.".

At the `rstudio::global(2021)` conference, Neal Richardson, Director of Engineering at Ursa Labs, compared Arrow csv performance to data.table csv performance, [Bigger Data With Ease Using Apache Arrow](https://twitter.com/enpiar/status/1357729619420475392). He opened by comparing to data.table as his main point. Arrow was presented as 3 times faster than data.table. He talked at length about this result. This result is now being quoted in the community. However, no reproducible code was provided and we were not contacted in advance of the high profile talk in case we had any comments. Neal briefly mentioned New York Taxi data. That is a dataset known to us as containing unmarked datetime. We don't know if he set `tz='UTC'` or not. We could have suggested that if he had asked. We do know that setting `tz='UTC'` does speed up reading the New York Taxi dataset significantly. We don't know if the datetimes in the New York Taxi dataset really are in UTC, or local time, but we know it is common practice to read them as if they are UTC regardless.

We are open source developers just trying to do our best.

As an angry reaction to Neal's presentation, the default change from `tz=""` to `tz=UTC` is accelerated. If you have been using `tz=` explicitly then there should be no change. The change to read UTC-marked datetime as POSIXct rather than character already happened in v1.13.0. The change now is that unmarked datetimes are now read as UTC too by default without needing to set `tz="UTC"`. None of the 1,004 CRAN packages directly using data.table are affected. As before, the migration option `datatable.old.fread.datetime.character` can still be set to TRUE to revert to the old character behaviour. This migration option is temporary and will be removed in the near future.

## BUG FIXES

1. If `fread()` discards a single line footer, the warning message which includes the discarded text now displays any non-ASCII characters correctly on Windows, [#4747](https://github.com/Rdatatable/data.table/issues/4747). Thanks to @shrektan for reporting and the PR.
Expand Down
2 changes: 1 addition & 1 deletion R/fread.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ skip="__auto__", select=NULL, drop=NULL, colClasses=NULL, integer64=getOption("d
col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, index=NULL,
showProgress=getOption("datatable.showProgress",interactive()), data.table=getOption("datatable.fread.datatable",TRUE),
nThread=getDTthreads(verbose), logical01=getOption("datatable.logical01",FALSE), keepLeadingZeros=getOption("datatable.keepLeadingZeros",FALSE),
yaml=FALSE, autostart=NA, tmpdir=tempdir(), tz="")
yaml=FALSE, autostart=NA, tmpdir=tempdir(), tz="UTC")
{
if (missing(input)+is.null(file)+is.null(text)+is.null(cmd) < 3L) stop("Used more than one of the arguments input=, file=, text= and cmd=.")
input_has_vars = length(all.vars(substitute(input)))>0L # see news for v1.11.6
Expand Down
18 changes: 10 additions & 8 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -10845,10 +10845,12 @@ TZnotUTC = !identical(tt,"") && !is_utc(tt)
if (TZnotUTC) {
# from v1.13.0 these tests work when running under non-UTC because they compare to as.POSIXct which reads these unmarked datetime in local
# the new tests 2150.* cover more cases
test(1743.25, fread("a,b,c\n2015-06-01 11:00:00,1,ae", colClasses=c("POSIXct","integer","character")), data.table(a=as.POSIXct("2015-06-01 11:00:00"),b=1L,c="ae"))
test(1743.26, fread("a,b,c,d,e,f,g,h\n1,k,2015-06-01 11:00:00,a,1.5,M,9,0", colClasses=list(POSIXct="c", character="b"), drop=c("a","b"), logical01=TRUE),
# from v1.14.0, the tz="" is needed
test(1743.25, fread("a,b,c\n2015-06-01 11:00:00,1,ae", colClasses=c("POSIXct","integer","character"), tz=""),
data.table(a=as.POSIXct("2015-06-01 11:00:00"),b=1L,c="ae"))
test(1743.26, fread("a,b,c,d,e,f,g,h\n1,k,2015-06-01 11:00:00,a,1.5,M,9,0", colClasses=list(POSIXct="c", character="b"), drop=c("a","b"), logical01=TRUE, tz=""),
ans<-data.table(c=as.POSIXct("2015-06-01 11:00:00"), d="a", e=1.5, f="M", g=9L, h=FALSE))
test(1743.27, fread("a,b,c,d,e,f,g,h\n1,k,2015-06-01 11:00:00,a,1.5,M,9,0", colClasses=list(POSIXct="c", character=2), drop=c("a","b"), logical01=TRUE),
test(1743.27, fread("a,b,c,d,e,f,g,h\n1,k,2015-06-01 11:00:00,a,1.5,M,9,0", colClasses=list(POSIXct="c", character=2), drop=c("a","b"), logical01=TRUE, tz=""),
ans)
}

Expand Down Expand Up @@ -17062,7 +17064,7 @@ test(2150.01, fread(tmp), DT) # defaults for fwrite/fread simple and preservin
fwrite(DT, tmp, dateTimeAs='write.csv') # as write.csv, writes the UTC times as-is not local because the time column has tzone=="UTC", but without the Z marker
oldtz = Sys.getenv("TZ", unset=NA)
Sys.unsetenv("TZ")
test(2150.021, sapply(fread(tmp), typeof), c(dates="integer", times="character")) # as before v1.13.0, datetime with missing timezone read as character
test(2150.021, sapply(fread(tmp,tz=""), typeof), c(dates="integer", times="character")) # from v1.14.0 tz="" needed to read datetime as character
test(2150.022, fread(tmp,tz="UTC"), DT) # user can tell fread to interpet the unmarked datetimes as UTC
Sys.setenv(TZ="UTC")
test(2150.023, fread(tmp), DT) # TZ environment variable is also recognized
Expand All @@ -17072,7 +17074,7 @@ if (.Platform$OS.type!="windows") {
# blank TZ env variable on non-Windows is recognized as UTC consistent with C and R; but R's tz= argument is the opposite and uses "" for local
}
Sys.unsetenv("TZ")
tt = fread(tmp, colClasses=list(POSIXct="times"))
tt = fread(tmp, colClasses=list(POSIXct="times"), tz="") # from v1.14.0 tz="" needed
test(2150.025, attr(tt$times, "tzone"), "") # as.POSIXct puts "" on the result (testing the write.csv version here with missing tzone)
# the times will be different though here because as.POSIXct read them as local time.
if (is.na(oldtz)) Sys.unsetenv("TZ") else Sys.setenv(TZ=oldtz)
Expand All @@ -17098,7 +17100,7 @@ test(2150.11,fread("a,b\n2015-01-01,2015-01-01", colClasses="POSIXct"), # local
data.table(a=as.POSIXct("2015-01-01"), b=as.POSIXct("2015-01-01")))
test(2150.12,fread("a,b\n2015-01-01,2015-01-01", select=c(a="Date",b="POSIXct")), # select colClasses form, for coverage
data.table(a=as.Date("2015-01-01"), b=as.POSIXct("2015-01-01")))
test(2150.13, fread("a,b\n2015-01-01,1.1\n2015-01-02 01:02:03,1.2"), # no Z so as character as before v1.13.0
test(2150.13, fread("a,b\n2015-01-01,1.1\n2015-01-02 01:02:03,1.2", tz=""), # no Z, tz="" needed for this test from v1.14.0
if (TZnotUTC) data.table(a=c("2015-01-01","2015-01-02 01:02:03"), b=c(1.1, 1.2))
else data.table(a=setattr(c(as.POSIXct("2015-01-01",tz="UTC"), as.POSIXct("2015-01-02 01:02:03",tz="UTC")),"tzone","UTC"), b=c(1.1, 1.2)))
# some rows are date-only, some rows UTC-timestamp --> read the date-only in UTC too
Expand All @@ -17112,9 +17114,9 @@ test(2150.16, fread("a,b,c\n2015-01-01,2015-01-02,2015-01-03 01:02:03", colClass
ans_print = capture.output(print(ans))
options(datatable.old.fread.datetime.character=NULL)
if (TZnotUTC) {
test(2150.17, fread("a,b,c\n2015-01-01,2015-01-02,2015-01-03 01:02:03", colClasses=c("Date","IDate","POSIXct")),
test(2150.17, fread("a,b,c\n2015-01-01,2015-01-02,2015-01-03 01:02:03", colClasses=c("Date","IDate","POSIXct"), tz=""),
ans, output=ans_print)
test(2150.18, fread("a,b,c\n2015-01-01,2015-01-02,2015-01-03 01:02:03", colClasses=c("Date",NA,NA)),
test(2150.18, fread("a,b,c\n2015-01-01,2015-01-02,2015-01-03 01:02:03", colClasses=c("Date",NA,NA), tz=""),
data.table(a=as.Date("2015-01-01"), b=as.IDate("2015-01-02"), c="2015-01-03 01:02:03"), output=ans_print)
} else {
test(2150.19, fread("a,b,c\n2015-01-01,2015-01-02,2015-01-03 01:02:03", colClasses=c("Date","IDate","POSIXct")),
Expand Down
4 changes: 2 additions & 2 deletions man/fread.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ data.table=getOption("datatable.fread.datatable", TRUE),
nThread=getDTthreads(verbose),
logical01=getOption("datatable.logical01", FALSE), # due to change to TRUE; see NEWS
keepLeadingZeros = getOption("datatable.keepLeadingZeros", FALSE),
yaml=FALSE, autostart=NA, tmpdir=tempdir(), tz=""
yaml=FALSE, autostart=NA, tmpdir=tempdir(), tz="UTC"
)
}
\arguments{
Expand Down Expand Up @@ -64,7 +64,7 @@ yaml=FALSE, autostart=NA, tmpdir=tempdir(), tz=""
\item{yaml}{ If \code{TRUE}, \code{fread} will attempt to parse (using \code{\link[yaml]{yaml.load}}) the top of the input as YAML, and further to glean parameters relevant to improving the performance of \code{fread} on the data itself. The entire YAML section is returned as parsed into a \code{list} in the \code{yaml_metadata} attribute. See \code{Details}. }
\item{autostart}{ Deprecated and ignored with warning. Please use \code{skip} instead. }
\item{tmpdir}{ Directory to use as the \code{tmpdir} argument for any \code{tempfile} calls, e.g. when the input is a URL or a shell command. The default is \code{tempdir()} which can be controlled by setting \code{TMPDIR} before starting the R session; see \code{\link[base:tempfile]{base::tempdir}}. }
\item{tz}{ Relevant to datetime values which have no Z or UTC-offset at the end, i.e. \emph{unmarked} datetime, as written by \code{\link[utils:write.table]{utils::write.csv}}. The default \code{tz=""} means interpet unmarked datetime in the timezone of the R session, for consistency with R's \code{as.POSIXct()} and backwards compatibility. Set \code{tz="UTC"} to read unmarked datetime in UTC. Note that \code{fwrite()} by default writes datetime in UTC including the final Z (i.e. UTC-marked datetime) and \code{fwrite}'s output will be read by \code{fread} consistently and quickly without needing to use \code{tz=} or \code{colClasses=}. If the TZ environment variable is set to \code{"UTC"} (or \code{""} on non-Windows where unset vs `""` is significant) then R's timezone is already UTC, the default \code{tz=""} means UTC, and unmarked datetime will be read as UTC. The TZ environment variable being unset, however, means local time, in both C and R, and is quite different from the TZ environment variable being set to \code{""} on non-Windows which means UTC not local. You can use \code{Sys.setenv(TZ="UTC")}, and \code{Sys.unsetenv("TZ")}, too, and \code{fread} will use the latest value. }
\item{tz}{ Relevant to datetime values which have no Z or UTC-offset at the end, i.e. \emph{unmarked} datetime, as written by \code{\link[utils:write.table]{utils::write.csv}}. The default \code{tz="UTC"} reads unmarked datetime as UTC POSIXct efficiently. \code{tz=""} reads unmarked datetime as type character (slowly) so that \code{as.POSIXct} can interpret (slowly) the character datetimes in local timezone; e.g. by using \code{"POSIXct"} in \code{colClasses=}. Note that \code{fwrite()} by default writes datetime in UTC including the final Z and therefore \code{fwrite}'s output will be read by \code{fread} consistently and quickly without needing to use \code{tz=} or \code{colClasses=}. If the \code{TZ} environment variable is set to \code{"UTC"} (or \code{""} on non-Windows where unset vs `""` is significant) then the R session's timezone is already UTC and \code{tz=""} will result in unmarked datetimes being read as UTC POSIXct. For more information, please see the news items from v1.13.0 and v1.14.0. }
}
\details{

Expand Down