Rdatatable · mattdowle · May 31, 2019 · May 29, 2019 · May 31, 2019 · May 31, 2019
@@ -80,7 +80,7 @@
 
 9. New convenience functions `%ilike%` and `%flike%` which map to new `like()` arguments `ignore.case` and `fixed` respectively, [#3333](https://github.com/Rdatatable/data.table/issues/3333). `%ilike%` is for case-insensitive pattern matching. `%flike%` is for more efficient matching of fixed strings. Thanks to @andreasLD for providing most of the core code.
 
-10. It is now possible to join two tables on their common columns, so called _natural join_, [#629](https://github.com/Rdatatable/data.table/issues/629). Use `on=.NATURAL` or `options("datatable.naturaljoin"=TRUE)`. Latter one works only when `x` has no key, if key is present then key columns are being used to join as before. Thanks to David Kulp for request.
+10. `on=.NATURAL` (TODO: `X[on=Y]`) joins two tables on their common column names, so called _natural join_, [#629](https://github.com/Rdatatable/data.table/issues/629). Thanks to David Kulp for request. As before, when `on=` is not provided, `X` must have a key and the key columns are used to join (like rownames, but multi-column and multi-type).
 
 11. `as.data.table` gains `key` argument mirroring its use in `setDT` and `data.table`, [#890](https://github.com/Rdatatable/data.table/issues/890). As a byproduct, the arguments of `as.data.table.array` have changed order, which could affect code relying on positional arguments to this method. Thanks @cooldome for the suggestion and @MichaelChirico for implementation.
 
@@ -159,6 +159,8 @@
 
 10. The `datatable.old.unique.by.key` option has been warning for 1 year that it is deprecated: `... Please stop using it and pass by=key(DT) instead for clarity ...`. This warning is now upgraded to error as per the schedule in note 10 of v1.11.0 (May 2018), and note 1 of v1.9.8 (Nov 2016). In June 2020 the option will be removed.
 
+11. We intend to deprecate the `datatable.nomatch` option, [more info](https://github.com/Rdatatable/data.table/pull/3578/files). A message is now printed upon use of the option (once per session) as a first step. It asks you to please stop using the option and to pass `nomatch=NULL` explicitly if you require inner join. Outer join (`nomatch=NA`) has always been the default because it is safer; it does not drop missing data silently. The problem is that the option is global; i.e., if a user changes the default using this option for their own use, that can change the behavior of joins inside packages that use `data.table` too. This is the only `data.table` option with this concern.
+
 
 ### Changes in [v1.12.2](https://github.com/Rdatatable/data.table/milestone/14?closed=1)  (07 Apr 2019)
 

@@ -238,7 +238,7 @@ replace_order = function(isub, verbose, env) {
   return(isub)
 }
 
-"[.data.table" = function (x, i, j, by, keyby, with=TRUE, nomatch=getOption("datatable.nomatch"), mult="all", roll=FALSE, rollends=if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which=FALSE, .SDcols, verbose=getOption("datatable.verbose"), allow.cartesian=getOption("datatable.allow.cartesian"), drop=NULL, on=NULL)
+"[.data.table" = function (x, i, j, by, keyby, with=TRUE, nomatch=getOption("datatable.nomatch", NA), mult="all", roll=FALSE, rollends=if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which=FALSE, .SDcols, verbose=getOption("datatable.verbose"), allow.cartesian=getOption("datatable.allow.cartesian"), drop=NULL, on=NULL)
 {
   # ..selfcount <<- ..selfcount+1  # in dev, we check no self calls, each of which doubles overhead, or could
   # test explicitly if the caller is [.data.table (even stronger test. TO DO.)
@@ -297,6 +297,7 @@ replace_order = function(isub, verbose, env) {
   if (length(rollends)>2L) stop("rollends must be length 1 or 2")
   if (length(rollends)==1L) rollends=rep.int(rollends,2L)
   # TO DO (document/faq/example). Removed for now ... if ((roll || rolltolast) && missing(mult)) mult="last" # for when there is exact match to mult. This does not control cases where the roll is mult, that is always the last one.
+  .unsafe.opt() #3585
   missingnomatch = missing(nomatch)
   if (is.null(nomatch)) nomatch = 0L # allow nomatch=NULL API already now, part of: https://github.com/Rdatatable/data.table/issues/857
   if (!is.na(nomatch) && nomatch!=0L) stop("nomatch= must be either NA or NULL (or 0 for backwards compatibility which is the same as NULL)")
@@ -437,15 +438,15 @@ replace_order = function(isub, verbose, env) {
       if (is.call(isub) && isub[[1L]] == "(" && !is.name(isub[[2L]]))
         isub = isub[[2L]]
     }
-    
+
     if (is.null(isub)) return( null.data.table() )
-    
+
     # optimize here so that we can switch it off if needed
     check_eval_env = environment()
     check_eval_env$eval_forder = FALSE
     if (getOption("datatable.optimize") >= 1) {
       isub = replace_order(isub, verbose, check_eval_env)
-    } 
+    }
     if (check_eval_env$eval_forder) {
       order_env = new.env(parent=parent.frame())            # until 'forder' is exported
       assign("forder", forder, order_env)
@@ -514,8 +515,7 @@ replace_order = function(isub, verbose, env) {
       naturaljoin = FALSE
       if (missing(on)) {
         if (!haskey(x)) {
-          if (getOption("datatable.naturaljoin")) naturaljoin = TRUE
-          else stop("When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.")
+          stop("When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.")
         }
       } else if (identical(substitute(on), as.name(".NATURAL"))) naturaljoin = TRUE
       if (naturaljoin) { # natural join #629

@@ -1,10 +1,10 @@
-foverlaps = function(x, y, by.x=if (!is.null(key(x))) key(x) else key(y), by.y=key(y), maxgap=0L, minoverlap=1L, type=c("any", "within", "start", "end", "equal"), mult=c("all", "first", "last"), nomatch=getOption("datatable.nomatch"), which=FALSE, verbose=getOption("datatable.verbose")) {
+foverlaps = function(x, y, by.x=if (!is.null(key(x))) key(x) else key(y), by.y=key(y), maxgap=0L, minoverlap=1L, type=c("any", "within", "start", "end", "equal"), mult=c("all", "first", "last"), nomatch=getOption("datatable.nomatch", NA), which=FALSE, verbose=getOption("datatable.verbose")) {
 
   if (!is.data.table(y) || !is.data.table(x)) stop("y and x must both be data.tables. Use `setDT()` to convert list/data.frames to data.tables by reference or as.data.table() to convert to data.tables by copying.")
   maxgap = as.integer(maxgap); minoverlap = as.integer(minoverlap)
   which = as.logical(which)
-  if (is.null(nomatch)) nomatch = 0L
-  nomatch = as.integer(nomatch)
+  .unsafe.opt() #3585
+  nomatch = if (is.null(nomatch)) 0L else as.integer(nomatch)
   if (!length(maxgap) || length(maxgap) != 1L || is.na(maxgap) || maxgap < 0L)
     stop("maxgap must be a non-negative integer value of length 1")
   if (!length(minoverlap) || length(minoverlap) != 1L || is.na(minoverlap) || minoverlap < 1L)

@@ -1,5 +1,18 @@
 # nocov start
 
+# used to raise message (write to STDERR but not raise warning) once per session only
+# in future this will be upgraded to warning, then error, until eventually removed after several years
+.pkg.store = new.env()
+.pkg.store$.unsafe.done = FALSE
+.unsafe.opt = function() {
+  if (.pkg.store$.unsafe.done) return(invisible())
+  val = getOption("datatable.nomatch")
+  if (is.null(val)) return(invisible())  # not set is ideal (it's no longer set in .onLoad)
+  if (identical(val, NA) || identical(val, NA_integer_)) return(invisible())  # set to default NA is ok for now; in future possible message/warning asking to remove
+  message("The option 'datatable.nomatch' is being used and is not set to the default NA. This option is still honored for now but will be deprecated in future. Please see NEWS for 1.12.4 for detailed information and motivation. To specify inner join, please specify `nomatch=NULL` explicity in your calls rather than changing the default using this option.")
+  .pkg.store$.unsafe.done = TRUE
+}
+
 .Last.updated = vector("integer", 1L) # exported variable; number of rows updated by the last := or set(), #1885
 
 .onLoad = function(libname, pkgname) {
@@ -42,7 +55,6 @@
   # are relatively heavy functions where the overhead in getOption() would not be noticed.  It's only really [.data.table where getOption default bit.
   # Improvement to base::getOption() now submitted (100x; 5s down to 0.05s):  https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17394
   opts = c("datatable.verbose"="FALSE",            # datatable.<argument name>
-       "datatable.nomatch"="NA_integer_",      # datatable.<argument name>
        "datatable.optimize"="Inf",             # datatable.<argument name>
        "datatable.print.nrows"="100L",         # datatable.<argument name>
        "datatable.print.topn"="5L",            # datatable.<argument name>
@@ -58,7 +70,6 @@
        "datatable.use.index"="TRUE",           # global switch to address #1422
        "datatable.prettyprint.char" = NULL,     # FR #1091
        "datatable.old.unique.by.key" = "FALSE"  # TODO: remove in May 2020
-       ,"datatable.naturaljoin" = "FALSE"      # natural join, when set to TRUE then `on` defaults to `.NATURAL`
        )
   for (i in setdiff(names(opts),names(options()))) {
     eval(parse(text=paste0("options(",i,"=",opts[i],")")))

@@ -96,8 +96,7 @@ oldOptions = options(
   datatable.verbose = FALSE,
   datatable.alloccol = 1024L,
   datatable.print.class = FALSE,  #  This is TRUE in cc.R and we like TRUE. But output= tests need to be updated (they assume FALSE currently)
-  datatable.rbindlist.check = NULL,
-  datatable.naturaljoin = FALSE
+  datatable.rbindlist.check = NULL
 )
 # some tests (e.g. 1066, 1293) rely on capturing output that will be garbled with small width
 if (getOption('width') < 80L) options(width = 80L)
@@ -14840,28 +14839,25 @@ d2 = data.table(id1=rep(1L,3), id2=3:5, v2=3:1)
 ans = data.table(id1=rep(1L, 3), id2=3:5, v1=c(2:3,NA_integer_), v2=3:1)
 test(2045.01, d1[d2], error="columns to join by must be specified")
 test(2045.02, d1[d2, on=.NATURAL, verbose=TRUE], ans, output="natural join using: [id1, id2]")
-options(datatable.naturaljoin=TRUE)
 test(2045.03, d1[d2, on=.(id1,id2)], ans)
 test(2045.04, d1[d2, on=.(id1,id2), nomatch=NULL], ans[1:2])
-test(2045.05, d1[d2, verbose=TRUE], ans, output="natural join using: [id1, id2]")
+test(2045.05, d1[d2, on=.NATURAL, verbose=TRUE], ans, output="natural join using: [id1, id2]")
 test(2045.06, d1[d2, on=.NATURAL, verbose=TRUE], ans, output="natural join using: [id1, id2]")
-test(2045.07, d1[d2, nomatch=NULL, verbose=TRUE], ans[1:2], output="natural join using: [id1, id2]")
+test(2045.07, d1[d2, nomatch=NULL, on=.NATURAL, verbose=TRUE], ans[1:2], output="natural join using: [id1, id2]")
 setkey(d1, id1)
 test(2045.08, nrow(d1[d2, allow.cartesian=TRUE]), 9L) # join
 test(2045.09, d1[d2, on=.NATURAL, verbose=TRUE], ans, output="natural join using: [id1, id2]") # ignore key when on=.NATURAL
 setkey(d1, NULL)
 setnames(d2, c("a","b","c"))
-test(2045.10, d1[d2], error="Attempting to do natural join but no common columns in provided tables")
-test(2045.11, d1[d2, on=.NATURAL], error="Attempting to do natural join but no common columns in provided tables")
+test(2045.10, d1[d2, on=.NATURAL], error="Attempting to do natural join but no common columns in provided tables")
 d2 = data.table(id1=2:4, id2=letters[3:5], v2=3:1)
-test(2045.12, d1[d2, on=.(id1,id2)], error="Incompatible join types: x.id2 (integer) and i.id2 (character)")
-test(2045.13, d1[d2, verbose=TRUE], output="natural join", error="Incompatible join types: x.id2 (integer) and i.id2 (character)")
-test(2045.14, d1[d1, verbose=TRUE], d1, output="natural join using all 'x' columns")
+test(2045.11, d1[d2, on=.(id1,id2)], error="Incompatible join types: x.id2 (integer) and i.id2 (character)")
+test(2045.12, d1[d2, on=.NATURAL, verbose=TRUE], output="natural join", error="Incompatible join types: x.id2 (integer) and i.id2 (character)")
+test(2045.13, d1[d1, on=.NATURAL, verbose=TRUE], d1, output="natural join using all 'x' columns")
 d1 = setDT(replicate(20L, 1L, simplify = FALSE))
 d2 = copy(d1[ , 1:15])
 setnames(d2, 1L, 'X1')
-test(2045.15, d1[d2, verbose = TRUE], cbind(d1, X1 = d2$X1), output="natural join using: \\[.*[.]{3}\\]")
-options(datatable.naturaljoin=FALSE)
+test(2045.14, d1[d2, on=.NATURAL, verbose=TRUE], cbind(d1, X1 = d2$X1), output="natural join using: \\[.*[.]{3}\\]")
 
 #tests for adding key to as.data.table, #890
 ## as.data.table.numeric (should cover as.data.table.factor,

@@ -21,7 +21,7 @@
 data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFactors=FALSE)
 
 \method{[}{data.table}(x, i, j, by, keyby, with = TRUE,
-  nomatch = getOption("datatable.nomatch"),                   # default: NA_integer_
+  nomatch = getOption("datatable.nomatch", NA),
   mult = "all",
   roll = FALSE,
   rollends = if (roll=="nearest") c(TRUE,TRUE)
@@ -154,7 +154,7 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
 
   \item{drop}{ Never used by \code{data.table}. Do not use. It needs to be here because \code{data.table} inherits from \code{data.frame}. See \href{vignettes/datatable-faq.html}{datatable-faq}.}
 
-  \item{on}{ Indicate which columns in \code{x} should be joined with which columns in \code{i} along with the type of binary operator to join with (see non-equi joins below on this). When specified, this overrides the keys set on \code{x} and \code{i}. When \code{.NATURAL} keyword provided then \emph{natural join} is made (join on common columns). Optionally when setting option \code{"datatable.naturaljoin"=TRUE} and missing \code{x} has no key then \code{on} defaults to \code{.NATURAL}. There are multiple ways of specifying the \code{on} argument:
+  \item{on}{ Indicate which columns in \code{x} should be joined with which columns in \code{i} along with the type of binary operator to join with (see non-equi joins below on this). When specified, this overrides the keys set on \code{x} and \code{i}. When \code{.NATURAL} keyword provided then \emph{natural join} is made (join on common columns). There are multiple ways of specifying the \code{on} argument:
         \itemize{
             \item{As an unnamed character vector, e.g., \code{X[Y, on=c("a", "b")]}, used when columns \code{a} and \code{b} are common to both \code{X} and \code{Y}.}
             \item{\emph{Foreign key joins}: As a \emph{named} character vector when the join columns have different names in \code{X} and \code{Y}.

@@ -20,7 +20,7 @@ foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y),
     by.y = key(y), maxgap = 0L, minoverlap = 1L,
     type = c("any", "within", "start", "end", "equal"),
     mult = c("all", "first", "last"),
-    nomatch = getOption("datatable.nomatch"),
+    nomatch = getOption("datatable.nomatch", NA),
     which = FALSE, verbose = getOption("datatable.verbose"))
 }
 \arguments{
@@ -64,7 +64,7 @@ of the overlap. This will be updated once \code{maxgap} is implemented.}
 \code{"first"} or \code{"last"}.}
 \item{nomatch}{ When a row (with interval say, \code{[a,b]}) in \code{x} has no
 match in \code{y}, \code{nomatch=NA} (default) means \code{NA} is returned for
-\code{y}'s non-\code{by.y} columns for that row of \code{x}. \code{nomatch=NULL} 
+\code{y}'s non-\code{by.y} columns for that row of \code{x}. \code{nomatch=NULL}
 (or \code{0} for backward compatibility) means no rows will be returned for that
 row of \code{x}. Use \code{options(datatable.nomatch=NULL)} to change the default
 value (used when \code{nomatch} is not supplied).}