Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add message that datatable.nomatch will be deprecated #3612

Merged
merged 4 commits into from
May 31, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@

9. New convenience functions `%ilike%` and `%flike%` which map to new `like()` arguments `ignore.case` and `fixed` respectively, [#3333](https://github.com/Rdatatable/data.table/issues/3333). `%ilike%` is for case-insensitive pattern matching. `%flike%` is for more efficient matching of fixed strings. Thanks to @andreasLD for providing most of the core code.

10. It is now possible to join two tables on their common columns, so called _natural join_, [#629](https://github.com/Rdatatable/data.table/issues/629). Use `on=.NATURAL` or `options("datatable.naturaljoin"=TRUE)`. Latter one works only when `x` has no key, if key is present then key columns are being used to join as before. Thanks to David Kulp for request.
10. `on=.NATURAL` (TODO: `X[on=Y]`) joins two tables on their common column names, so called _natural join_, [#629](https://github.com/Rdatatable/data.table/issues/629). Thanks to David Kulp for request. As before, when `on=` is not provided, `X` must have a key and the key columns are used to join (like rownames, but multi-column and multi-type).

11. `as.data.table` gains `key` argument mirroring its use in `setDT` and `data.table`, [#890](https://github.com/Rdatatable/data.table/issues/890). As a byproduct, the arguments of `as.data.table.array` have changed order, which could affect code relying on positional arguments to this method. Thanks @cooldome for the suggestion and @MichaelChirico for implementation.

Expand Down Expand Up @@ -159,6 +159,8 @@

10. The `datatable.old.unique.by.key` option has been warning for 1 year that it is deprecated: `... Please stop using it and pass by=key(DT) instead for clarity ...`. This warning is now upgraded to error as per the schedule in note 10 of v1.11.0 (May 2018), and note 1 of v1.9.8 (Nov 2016). In June 2020 the option will be removed.

11. We intend to deprecate the `datatable.nomatch` option, [more info](https://github.com/Rdatatable/data.table/pull/3578/files). A message is now printed upon use of the option (once per session) as a first step. It asks you to please stop using the option and to pass `nomatch=NULL` explicitly if you require inner join. Outer join (`nomatch=NA`) has always been the default because it is safer; it does not drop missing data silently. The problem is that the option is global; i.e., if a user changes the default using this option for their own use, that can change the behavior of joins inside packages that use `data.table` too. This is the only `data.table` option with this concern.


### Changes in [v1.12.2](https://github.com/Rdatatable/data.table/milestone/14?closed=1) (07 Apr 2019)

Expand Down
12 changes: 6 additions & 6 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,7 @@ replace_order = function(isub, verbose, env) {
return(isub)
}

"[.data.table" = function (x, i, j, by, keyby, with=TRUE, nomatch=getOption("datatable.nomatch"), mult="all", roll=FALSE, rollends=if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which=FALSE, .SDcols, verbose=getOption("datatable.verbose"), allow.cartesian=getOption("datatable.allow.cartesian"), drop=NULL, on=NULL)
"[.data.table" = function (x, i, j, by, keyby, with=TRUE, nomatch=getOption("datatable.nomatch", NA), mult="all", roll=FALSE, rollends=if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which=FALSE, .SDcols, verbose=getOption("datatable.verbose"), allow.cartesian=getOption("datatable.allow.cartesian"), drop=NULL, on=NULL)
{
# ..selfcount <<- ..selfcount+1 # in dev, we check no self calls, each of which doubles overhead, or could
# test explicitly if the caller is [.data.table (even stronger test. TO DO.)
Expand Down Expand Up @@ -297,6 +297,7 @@ replace_order = function(isub, verbose, env) {
if (length(rollends)>2L) stop("rollends must be length 1 or 2")
if (length(rollends)==1L) rollends=rep.int(rollends,2L)
# TO DO (document/faq/example). Removed for now ... if ((roll || rolltolast) && missing(mult)) mult="last" # for when there is exact match to mult. This does not control cases where the roll is mult, that is always the last one.
.unsafe.opt() #3585
missingnomatch = missing(nomatch)
if (is.null(nomatch)) nomatch = 0L # allow nomatch=NULL API already now, part of: https://github.com/Rdatatable/data.table/issues/857
if (!is.na(nomatch) && nomatch!=0L) stop("nomatch= must be either NA or NULL (or 0 for backwards compatibility which is the same as NULL)")
Expand Down Expand Up @@ -437,15 +438,15 @@ replace_order = function(isub, verbose, env) {
if (is.call(isub) && isub[[1L]] == "(" && !is.name(isub[[2L]]))
isub = isub[[2L]]
}

if (is.null(isub)) return( null.data.table() )

# optimize here so that we can switch it off if needed
check_eval_env = environment()
check_eval_env$eval_forder = FALSE
if (getOption("datatable.optimize") >= 1) {
isub = replace_order(isub, verbose, check_eval_env)
}
}
if (check_eval_env$eval_forder) {
order_env = new.env(parent=parent.frame()) # until 'forder' is exported
assign("forder", forder, order_env)
Expand Down Expand Up @@ -514,8 +515,7 @@ replace_order = function(isub, verbose, env) {
naturaljoin = FALSE
if (missing(on)) {
if (!haskey(x)) {
if (getOption("datatable.naturaljoin")) naturaljoin = TRUE
else stop("When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.")
stop("When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.")
}
} else if (identical(substitute(on), as.name(".NATURAL"))) naturaljoin = TRUE
if (naturaljoin) { # natural join #629
Expand Down
6 changes: 3 additions & 3 deletions R/foverlaps.R
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
foverlaps = function(x, y, by.x=if (!is.null(key(x))) key(x) else key(y), by.y=key(y), maxgap=0L, minoverlap=1L, type=c("any", "within", "start", "end", "equal"), mult=c("all", "first", "last"), nomatch=getOption("datatable.nomatch"), which=FALSE, verbose=getOption("datatable.verbose")) {
foverlaps = function(x, y, by.x=if (!is.null(key(x))) key(x) else key(y), by.y=key(y), maxgap=0L, minoverlap=1L, type=c("any", "within", "start", "end", "equal"), mult=c("all", "first", "last"), nomatch=getOption("datatable.nomatch", NA), which=FALSE, verbose=getOption("datatable.verbose")) {

if (!is.data.table(y) || !is.data.table(x)) stop("y and x must both be data.tables. Use `setDT()` to convert list/data.frames to data.tables by reference or as.data.table() to convert to data.tables by copying.")
maxgap = as.integer(maxgap); minoverlap = as.integer(minoverlap)
which = as.logical(which)
if (is.null(nomatch)) nomatch = 0L
nomatch = as.integer(nomatch)
.unsafe.opt() #3585
nomatch = if (is.null(nomatch)) 0L else as.integer(nomatch)
if (!length(maxgap) || length(maxgap) != 1L || is.na(maxgap) || maxgap < 0L)
stop("maxgap must be a non-negative integer value of length 1")
if (!length(minoverlap) || length(minoverlap) != 1L || is.na(minoverlap) || minoverlap < 1L)
Expand Down
15 changes: 13 additions & 2 deletions R/onLoad.R
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# nocov start

# used to raise message (write to STDERR but not raise warning) once per session only
# in future this will be upgraded to warning, then error, until eventually removed after several years
.pkg.store = new.env()
.pkg.store$.unsafe.done = FALSE
.unsafe.opt = function() {
if (.pkg.store$.unsafe.done) return(invisible())
val = getOption("datatable.nomatch")
if (is.null(val)) return(invisible()) # not set is ideal (it's no longer set in .onLoad)
if (identical(val, NA) || identical(val, NA_integer_)) return(invisible()) # set to default NA is ok for now; in future possible message/warning asking to remove
message("The option 'datatable.nomatch' is being used and is not set to the default NA. This option is still honored for now but will be deprecated in future. Please see NEWS for 1.12.4 for detailed information and motivation. To specify inner join, please specify `nomatch=NULL` explicity in your calls rather than changing the default using this option.")
.pkg.store$.unsafe.done = TRUE
}

.Last.updated = vector("integer", 1L) # exported variable; number of rows updated by the last := or set(), #1885

.onLoad = function(libname, pkgname) {
Expand Down Expand Up @@ -42,7 +55,6 @@
# are relatively heavy functions where the overhead in getOption() would not be noticed. It's only really [.data.table where getOption default bit.
# Improvement to base::getOption() now submitted (100x; 5s down to 0.05s): https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17394
opts = c("datatable.verbose"="FALSE", # datatable.<argument name>
"datatable.nomatch"="NA_integer_", # datatable.<argument name>
"datatable.optimize"="Inf", # datatable.<argument name>
"datatable.print.nrows"="100L", # datatable.<argument name>
"datatable.print.topn"="5L", # datatable.<argument name>
Expand All @@ -58,7 +70,6 @@
"datatable.use.index"="TRUE", # global switch to address #1422
"datatable.prettyprint.char" = NULL, # FR #1091
"datatable.old.unique.by.key" = "FALSE" # TODO: remove in May 2020
,"datatable.naturaljoin" = "FALSE" # natural join, when set to TRUE then `on` defaults to `.NATURAL`
)
for (i in setdiff(names(opts),names(options()))) {
eval(parse(text=paste0("options(",i,"=",opts[i],")")))
Expand Down
20 changes: 8 additions & 12 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -96,8 +96,7 @@ oldOptions = options(
datatable.verbose = FALSE,
datatable.alloccol = 1024L,
datatable.print.class = FALSE, # This is TRUE in cc.R and we like TRUE. But output= tests need to be updated (they assume FALSE currently)
datatable.rbindlist.check = NULL,
datatable.naturaljoin = FALSE
datatable.rbindlist.check = NULL
)
# some tests (e.g. 1066, 1293) rely on capturing output that will be garbled with small width
if (getOption('width') < 80L) options(width = 80L)
Expand Down Expand Up @@ -14840,28 +14839,25 @@ d2 = data.table(id1=rep(1L,3), id2=3:5, v2=3:1)
ans = data.table(id1=rep(1L, 3), id2=3:5, v1=c(2:3,NA_integer_), v2=3:1)
test(2045.01, d1[d2], error="columns to join by must be specified")
test(2045.02, d1[d2, on=.NATURAL, verbose=TRUE], ans, output="natural join using: [id1, id2]")
options(datatable.naturaljoin=TRUE)
test(2045.03, d1[d2, on=.(id1,id2)], ans)
test(2045.04, d1[d2, on=.(id1,id2), nomatch=NULL], ans[1:2])
test(2045.05, d1[d2, verbose=TRUE], ans, output="natural join using: [id1, id2]")
test(2045.05, d1[d2, on=.NATURAL, verbose=TRUE], ans, output="natural join using: [id1, id2]")
test(2045.06, d1[d2, on=.NATURAL, verbose=TRUE], ans, output="natural join using: [id1, id2]")
test(2045.07, d1[d2, nomatch=NULL, verbose=TRUE], ans[1:2], output="natural join using: [id1, id2]")
test(2045.07, d1[d2, nomatch=NULL, on=.NATURAL, verbose=TRUE], ans[1:2], output="natural join using: [id1, id2]")
setkey(d1, id1)
test(2045.08, nrow(d1[d2, allow.cartesian=TRUE]), 9L) # join
test(2045.09, d1[d2, on=.NATURAL, verbose=TRUE], ans, output="natural join using: [id1, id2]") # ignore key when on=.NATURAL
setkey(d1, NULL)
setnames(d2, c("a","b","c"))
test(2045.10, d1[d2], error="Attempting to do natural join but no common columns in provided tables")
test(2045.11, d1[d2, on=.NATURAL], error="Attempting to do natural join but no common columns in provided tables")
test(2045.10, d1[d2, on=.NATURAL], error="Attempting to do natural join but no common columns in provided tables")
d2 = data.table(id1=2:4, id2=letters[3:5], v2=3:1)
test(2045.12, d1[d2, on=.(id1,id2)], error="Incompatible join types: x.id2 (integer) and i.id2 (character)")
test(2045.13, d1[d2, verbose=TRUE], output="natural join", error="Incompatible join types: x.id2 (integer) and i.id2 (character)")
test(2045.14, d1[d1, verbose=TRUE], d1, output="natural join using all 'x' columns")
test(2045.11, d1[d2, on=.(id1,id2)], error="Incompatible join types: x.id2 (integer) and i.id2 (character)")
test(2045.12, d1[d2, on=.NATURAL, verbose=TRUE], output="natural join", error="Incompatible join types: x.id2 (integer) and i.id2 (character)")
test(2045.13, d1[d1, on=.NATURAL, verbose=TRUE], d1, output="natural join using all 'x' columns")
d1 = setDT(replicate(20L, 1L, simplify = FALSE))
d2 = copy(d1[ , 1:15])
setnames(d2, 1L, 'X1')
test(2045.15, d1[d2, verbose = TRUE], cbind(d1, X1 = d2$X1), output="natural join using: \\[.*[.]{3}\\]")
options(datatable.naturaljoin=FALSE)
test(2045.14, d1[d2, on=.NATURAL, verbose=TRUE], cbind(d1, X1 = d2$X1), output="natural join using: \\[.*[.]{3}\\]")

#tests for adding key to as.data.table, #890
## as.data.table.numeric (should cover as.data.table.factor,
Expand Down
4 changes: 2 additions & 2 deletions man/data.table.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFactors=FALSE)

\method{[}{data.table}(x, i, j, by, keyby, with = TRUE,
nomatch = getOption("datatable.nomatch"), # default: NA_integer_
nomatch = getOption("datatable.nomatch", NA),
mult = "all",
roll = FALSE,
rollends = if (roll=="nearest") c(TRUE,TRUE)
Expand Down Expand Up @@ -154,7 +154,7 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac

\item{drop}{ Never used by \code{data.table}. Do not use. It needs to be here because \code{data.table} inherits from \code{data.frame}. See \href{vignettes/datatable-faq.html}{datatable-faq}.}

\item{on}{ Indicate which columns in \code{x} should be joined with which columns in \code{i} along with the type of binary operator to join with (see non-equi joins below on this). When specified, this overrides the keys set on \code{x} and \code{i}. When \code{.NATURAL} keyword provided then \emph{natural join} is made (join on common columns). Optionally when setting option \code{"datatable.naturaljoin"=TRUE} and missing \code{x} has no key then \code{on} defaults to \code{.NATURAL}. There are multiple ways of specifying the \code{on} argument:
\item{on}{ Indicate which columns in \code{x} should be joined with which columns in \code{i} along with the type of binary operator to join with (see non-equi joins below on this). When specified, this overrides the keys set on \code{x} and \code{i}. When \code{.NATURAL} keyword provided then \emph{natural join} is made (join on common columns). There are multiple ways of specifying the \code{on} argument:
\itemize{
\item{As an unnamed character vector, e.g., \code{X[Y, on=c("a", "b")]}, used when columns \code{a} and \code{b} are common to both \code{X} and \code{Y}.}
\item{\emph{Foreign key joins}: As a \emph{named} character vector when the join columns have different names in \code{X} and \code{Y}.
Expand Down
4 changes: 2 additions & 2 deletions man/foverlaps.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y),
by.y = key(y), maxgap = 0L, minoverlap = 1L,
type = c("any", "within", "start", "end", "equal"),
mult = c("all", "first", "last"),
nomatch = getOption("datatable.nomatch"),
nomatch = getOption("datatable.nomatch", NA),
which = FALSE, verbose = getOption("datatable.verbose"))
}
\arguments{
Expand Down Expand Up @@ -64,7 +64,7 @@ of the overlap. This will be updated once \code{maxgap} is implemented.}
\code{"first"} or \code{"last"}.}
\item{nomatch}{ When a row (with interval say, \code{[a,b]}) in \code{x} has no
match in \code{y}, \code{nomatch=NA} (default) means \code{NA} is returned for
\code{y}'s non-\code{by.y} columns for that row of \code{x}. \code{nomatch=NULL}
\code{y}'s non-\code{by.y} columns for that row of \code{x}. \code{nomatch=NULL}
(or \code{0} for backward compatibility) means no rows will be returned for that
row of \code{x}. Use \code{options(datatable.nomatch=NULL)} to change the default
value (used when \code{nomatch} is not supplied).}
Expand Down