Merge branch 'master' into non-equi-joins

* master: Allow x's cols to be referred to using 'x.' prefix, addresses #1615 adding some helpful info on dev branch checks fixes invalid vignette urls in manual Remove old vignette. More clarifications to secondary indices. Clarify that 'on' doesn't *create* secondary indices. Remove unwanted text in secondary indices vignette. Rename vignette. Fix size of header for FAQ vignette. New vignette on secondary indices and auto indexing based subsets. More minor fix to vignette. Minor formatting fixes to vignettes. Fix Typos & rm. trailing whitespace in vignettes
Rdatatable · Apr 2, 2016 · 644f3bd · 644f3bd
2 parents b271cbd + 620276b
commit 644f3bd
Show file tree

Hide file tree

Showing 11 changed files with 633 additions and 280 deletions.
diff --git a/Contributing.md b/Contributing.md
@@ -23,7 +23,7 @@ Please file an issue before creating PRs so that it can be discussed first *befo
 4. Unless there's a strong reason against, please **squash all your commits together before issuing a PR**, since you would be working on one issue. 
 5. In your pull request's description, please state clearly as to what your PR does, i.e., what FR or bug your PR addresses, along with the issue number. For e.g, "Closes #717, rbindlist segfault on factor columns fixed and added tests."
 6. All bug fixes and feature requests should also have **tests** added, to help catch any regressions while fixing another issue some time later. Tests should be added to `inst/tests/tests.Rraw` file. 
-7. Ensure that all tests pass by typing `test.data.table()` on your branch. It's also better to `R CMD check --as-cran`. PRs with failed tests can't be merged of course, and it is hard to debug every PR and explain why it fails and how to fix it. The lesser the feedback required, the faster it is likely to be merged.
+7. Ensure that all tests pass by typing `test.data.table()` after instaling your branch. It's also better to `R CMD check --as-cran` against your branch source package archive `.tar.gz` file. You may want to add `--no-manual`, `--no-build-vignettes` or `--ignore-vignettes` (R 3.3.0+) options to reduce dependencies required to perform check. PRs with failed tests can't be merged of course, and it is hard to debug every PR and explain why it fails and how to fix it. The lesser the feedback required, the faster it is likely to be merged.
 8. The `README.md` file also has to be updated while fixing or implementing an issue. It should mention the issue number (along with the link) and what the issue is being closed. And also add a "Thanks to @your_name for the PR".
 
 **References:** If you are not sure how to issue a PR, but would like to contribute, these links should help get you started:

diff --git a/R/data.table.R b/R/data.table.R
@@ -782,7 +782,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
     if (missing(j)) {
         # missing(by)==TRUE was already checked above before dealing with i
         if (!length(x)) return(null.data.table())
-        if (!length(leftcols)) {               
+        if (!length(leftcols)) {
             ansvars = names(x)
             jisvars = character()
             xcols = xcolsAns = seq_along(x)
@@ -1076,7 +1076,8 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
                     }
                 }
                 # fix for long standing FR/bug, #495 and #484
-                if ( length(othervars <- setdiff(intersect(av, names(x)), c(bynames, ansvars))) ) {
+                allcols = c(names(x), paste("x.",names(x),sep=""), if (is.data.table(i)) c(names(i), paste("i.", names(i), sep="")))
+                if ( length(othervars <- setdiff(intersect(av, allcols), c(bynames, ansvars))) ) {
                     # we've a situation like DT[, c(sum(V1), lapply(.SD, mean)), by=., .SDcols=...] or 
                     # DT[, lapply(.SD, function(x) x *v1), by=, .SDcols=...] etc., 
                     ansvars = union(ansvars, othervars)
@@ -1085,7 +1086,8 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
                 # .SDcols might include grouping columns if users wants that, but normally we expect user not to include them in .SDcols
             } else {
                 if (!missing(.SDcols)) warning("This j doesn't use .SD but .SDcols has been supplied. Ignoring .SDcols. See ?data.table.")
-                ansvars = setdiff(intersect(av,c(names(x),names(i),paste("i.",names(i),sep=""))), bynames)
+                allcols = c(names(x), paste("x.",names(x),sep=""), if (is.data.table(i)) c(names(i), paste("i.", names(i), sep="")))
+                ansvars = setdiff(intersect(av,allcols), bynames)
                 if (verbose) cat("Detected that j uses these columns:",if (!length(ansvars)) "<none>" else paste(ansvars,collapse=","),"\n")
                 # using a few named columns will be faster
                 # Consider:   DT[,max(diff(date)),by=list(month=month(date))]
@@ -1101,7 +1103,8 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
                     # get('varname') is too difficult to detect which columns are used in general
                     # eval(macro) column names are detected via the  if jsub[[1]]==eval switch earlier above.
                 }
-                ansvars = setdiff(c(names(x), if (is.data.table(i)) c(names(i), paste("i.", names(i), sep=""))),bynames) # fix for bug #5443
+                allcols = c(names(x), paste("x.",names(x),sep=""), if (is.data.table(i)) c(names(i), paste("i.", names(i), sep="")))
+                ansvars = setdiff(allcols,bynames) # fix for bug #5443
                 ansvals = chmatch(ansvars, names(x))
                 if (verbose) cat("New:",paste(ansvars,collapse=","),"\n")
             }
@@ -1215,7 +1218,15 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
 
         if (length(ansvars)) {
             w = ansvals
-            if (length(rightcols) && missing(by)) w[ w %in% rightcols ] = NA
+            if (length(rightcols) && missing(by)) {
+                w[ w %in% rightcols ] = NA
+            }
+            # patch for #1615. Allow 'x.' syntax. Only useful during join op when x's join col needs to be used.
+            # Note that I specifically have not implemented x[y, aa, on=c(aa="bb")] to refer to x's join column 
+            # as well because x[i, col] == x[i][, col] will not be TRUE anymore..
+            xjoincols = paste("x.",names(x),sep="")
+            if ( any(xjoinvals <- ansvars %in% xjoincols))
+                w[xjoinvals] = chmatch(ansvars[xjoinvals], xjoincols)
             if (!any(wna <- is.na(w))) {
                 xcols = w
                 xcolsAns = seq_along(ansvars)

diff --git a/README.md b/README.md
@@ -74,6 +74,8 @@
 
   29. New `split` method for data.table. Faster, more flexible and consistent with data.frame method. Closes [#1389](https://github.com/Rdatatable/data.table/issues/1389).
 
+  30. x's columns can be referred to in `j` using the prefix `x.` at all times. This is particularly useful when it is necessary to x's column that is *also a join column*. This is a patch addressing [#1615](https://github.com/Rdatatable/data.table/issues/1615).
+
 #### BUG FIXES
 
   1. Now compiles and runs on IBM AIX gcc. Thanks to Vinh Nguyen for investigation and testing, [#1351](https://github.com/Rdatatable/data.table/issues/1351).

diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw
@@ -8404,6 +8404,13 @@ test(1639.137, sort.by.names(ans), sort.by.names(unlist(split(setDT(df), by=c("p
 test(1639.138, ans, split(as.data.table(df), by=c("product","year")))
 test(1639.139, sort.by.names(ans), sort.by.names(unlist(split(as.data.table(df), by=c("product","year"), flatten=FALSE), recursive = FALSE)))
 
+# allow x's cols (specifically x's join cols) to be referred to using 'x.' syntax
+# patch for #1615. Note that I specifically have not implemented x[y, aa, on=c(aa="bb")] 
+# to refer to x's join column as well because x[i, col] == x[i][, col] will not be TRUE anymore..
+x <- data.table(aa = 1:3, cc = letters[1:3])
+y <- data.table(bb = 3:5, dd = 3:1)
+test(1640.1, x[y, x.aa, on=c(aa="bb")], INT(3,NA,NA))
+test(1640.2, x[y, c(.SD, .(x.aa=x.aa)), on=c(aa="bb")], data.table(aa=3:5, cc=c("c", NA,NA), x.aa=INT(3,NA,NA)))
 
 ##########################
 

diff --git a/man/data.table.Rd b/man/data.table.Rd
@@ -68,7 +68,7 @@ data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFacto
 
         \emph{Advanced:} When \code{i} is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope.
 
-        See \href{../doc/datatable-keys-fast-subsets}{Keys and fast binary search based subsets}, \href{../doc/datatable-secondary-indices}{Secondary indices} and \href{datatable-extend-subsets-to-joins}{Extending subsets to joins} vignettes.
+        See \href{../doc/datatable-keys-fast-subset.html}{Keys and fast binary search based subset}, \href{../doc/datatable-secondary-indices-and-auto-indexing.html}{Secondary indices and auto indexing} and \href{../doc/datatable-extend-subsets-to-joins.html}{Extending subsets to joins} vignettes.
     }
 
     \item{j}{When \code{with=TRUE} (default), \code{j} is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables. This allows to not just \emph{select} columns in \code{j}, but also \code{compute} on them e.g., \code{x[, a]} and \code{x[, sum(a)]} returns \code{x$a} and \code{sum(x$a)} as a vector respectively. \code{x[, .(a, b)]} and \code{x[, .(sa=sum(a), sb=sum(b))]} returns a two column data.table each, the first simply \emph{selecting} columns \code{a, b} and the second \emph{computing} their sums. 
@@ -143,7 +143,7 @@ data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFacto
 
   \item{drop}{ Never used by \code{data.table}. Do not use. It needs to be here because \code{data.table} inherits from \code{data.frame}. See \href{../doc/datatable-faq.html}{datatable-faq}.}
 
-  \item{on}{ A named atomic vector of column names indicating which columns in \code{i} should be joined to which columns in \code{x}. When specified, this overrides the keys set on \code{x} and \code{i}. See \href{../doc/secondary-indices.html}{Secondary indices} and \href{../doc/datatable-extend-subsets-to-joins.html}{Extending subsets to joins} vignettes, and examples.}
+  \item{on}{ A named atomic vector of column names indicating which columns in \code{i} should be joined to which columns in \code{x}. When specified, this overrides the keys set on \code{x} and \code{i}. See \href{../doc/datatable-secondary-indices-and-auto-indexing.html}{Secondary indices and auto indexing} and \href{../doc/datatable-extend-subsets-to-joins.html}{Extending subsets to joins} vignettes, and examples.}
 }
 \details{
 \code{data.table} builds on base \R functionality to reduce 2 types of time:\cr