Skip to content

Commit

Permalink
Merge branch 'master' into non-equi-joins
Browse files Browse the repository at this point in the history
* master:
  Allow x's cols to be referred to using 'x.' prefix, addresses #1615
  adding some helpful info on dev branch checks
  fixes invalid vignette urls in manual
  Remove old vignette.
  More clarifications to secondary indices.
  Clarify that 'on' doesn't *create* secondary indices.
  Remove unwanted text in secondary indices vignette.
  Rename vignette.
  Fix size of header for FAQ  vignette.
  New vignette on secondary indices and auto indexing based subsets.
  More minor fix to vignette.
  Minor formatting fixes to vignettes.
  Fix Typos & rm. trailing whitespace in vignettes
  • Loading branch information
arunsrinivasan committed Apr 2, 2016
2 parents b271cbd + 620276b commit 644f3bd
Show file tree
Hide file tree
Showing 11 changed files with 633 additions and 280 deletions.
2 changes: 1 addition & 1 deletion Contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Please file an issue before creating PRs so that it can be discussed first *befo
4. Unless there's a strong reason against, please **squash all your commits together before issuing a PR**, since you would be working on one issue.
5. In your pull request's description, please state clearly as to what your PR does, i.e., what FR or bug your PR addresses, along with the issue number. For e.g, "Closes #717, rbindlist segfault on factor columns fixed and added tests."
6. All bug fixes and feature requests should also have **tests** added, to help catch any regressions while fixing another issue some time later. Tests should be added to `inst/tests/tests.Rraw` file.
7. Ensure that all tests pass by typing `test.data.table()` on your branch. It's also better to `R CMD check --as-cran`. PRs with failed tests can't be merged of course, and it is hard to debug every PR and explain why it fails and how to fix it. The lesser the feedback required, the faster it is likely to be merged.
7. Ensure that all tests pass by typing `test.data.table()` after instaling your branch. It's also better to `R CMD check --as-cran` against your branch source package archive `.tar.gz` file. You may want to add `--no-manual`, `--no-build-vignettes` or `--ignore-vignettes` (R 3.3.0+) options to reduce dependencies required to perform check. PRs with failed tests can't be merged of course, and it is hard to debug every PR and explain why it fails and how to fix it. The lesser the feedback required, the faster it is likely to be merged.
8. The `README.md` file also has to be updated while fixing or implementing an issue. It should mention the issue number (along with the link) and what the issue is being closed. And also add a "Thanks to @your_name for the PR".

**References:** If you are not sure how to issue a PR, but would like to contribute, these links should help get you started:
Expand Down
21 changes: 16 additions & 5 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -782,7 +782,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
if (missing(j)) {
# missing(by)==TRUE was already checked above before dealing with i
if (!length(x)) return(null.data.table())
if (!length(leftcols)) {
if (!length(leftcols)) {
ansvars = names(x)
jisvars = character()
xcols = xcolsAns = seq_along(x)
Expand Down Expand Up @@ -1076,7 +1076,8 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
}
}
# fix for long standing FR/bug, #495 and #484
if ( length(othervars <- setdiff(intersect(av, names(x)), c(bynames, ansvars))) ) {
allcols = c(names(x), paste("x.",names(x),sep=""), if (is.data.table(i)) c(names(i), paste("i.", names(i), sep="")))
if ( length(othervars <- setdiff(intersect(av, allcols), c(bynames, ansvars))) ) {
# we've a situation like DT[, c(sum(V1), lapply(.SD, mean)), by=., .SDcols=...] or
# DT[, lapply(.SD, function(x) x *v1), by=, .SDcols=...] etc.,
ansvars = union(ansvars, othervars)
Expand All @@ -1085,7 +1086,8 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
# .SDcols might include grouping columns if users wants that, but normally we expect user not to include them in .SDcols
} else {
if (!missing(.SDcols)) warning("This j doesn't use .SD but .SDcols has been supplied. Ignoring .SDcols. See ?data.table.")
ansvars = setdiff(intersect(av,c(names(x),names(i),paste("i.",names(i),sep=""))), bynames)
allcols = c(names(x), paste("x.",names(x),sep=""), if (is.data.table(i)) c(names(i), paste("i.", names(i), sep="")))
ansvars = setdiff(intersect(av,allcols), bynames)
if (verbose) cat("Detected that j uses these columns:",if (!length(ansvars)) "<none>" else paste(ansvars,collapse=","),"\n")
# using a few named columns will be faster
# Consider: DT[,max(diff(date)),by=list(month=month(date))]
Expand All @@ -1101,7 +1103,8 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
# get('varname') is too difficult to detect which columns are used in general
# eval(macro) column names are detected via the if jsub[[1]]==eval switch earlier above.
}
ansvars = setdiff(c(names(x), if (is.data.table(i)) c(names(i), paste("i.", names(i), sep=""))),bynames) # fix for bug #5443
allcols = c(names(x), paste("x.",names(x),sep=""), if (is.data.table(i)) c(names(i), paste("i.", names(i), sep="")))
ansvars = setdiff(allcols,bynames) # fix for bug #5443
ansvals = chmatch(ansvars, names(x))
if (verbose) cat("New:",paste(ansvars,collapse=","),"\n")
}
Expand Down Expand Up @@ -1215,7 +1218,15 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {

if (length(ansvars)) {
w = ansvals
if (length(rightcols) && missing(by)) w[ w %in% rightcols ] = NA
if (length(rightcols) && missing(by)) {
w[ w %in% rightcols ] = NA
}
# patch for #1615. Allow 'x.' syntax. Only useful during join op when x's join col needs to be used.
# Note that I specifically have not implemented x[y, aa, on=c(aa="bb")] to refer to x's join column
# as well because x[i, col] == x[i][, col] will not be TRUE anymore..
xjoincols = paste("x.",names(x),sep="")
if ( any(xjoinvals <- ansvars %in% xjoincols))
w[xjoinvals] = chmatch(ansvars[xjoinvals], xjoincols)
if (!any(wna <- is.na(w))) {
xcols = w
xcolsAns = seq_along(ansvars)
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,8 @@

29. New `split` method for data.table. Faster, more flexible and consistent with data.frame method. Closes [#1389](https://github.com/Rdatatable/data.table/issues/1389).

30. x's columns can be referred to in `j` using the prefix `x.` at all times. This is particularly useful when it is necessary to x's column that is *also a join column*. This is a patch addressing [#1615](https://github.com/Rdatatable/data.table/issues/1615).

#### BUG FIXES

1. Now compiles and runs on IBM AIX gcc. Thanks to Vinh Nguyen for investigation and testing, [#1351](https://github.com/Rdatatable/data.table/issues/1351).
Expand Down
7 changes: 7 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -8404,6 +8404,13 @@ test(1639.137, sort.by.names(ans), sort.by.names(unlist(split(setDT(df), by=c("p
test(1639.138, ans, split(as.data.table(df), by=c("product","year")))
test(1639.139, sort.by.names(ans), sort.by.names(unlist(split(as.data.table(df), by=c("product","year"), flatten=FALSE), recursive = FALSE)))

# allow x's cols (specifically x's join cols) to be referred to using 'x.' syntax
# patch for #1615. Note that I specifically have not implemented x[y, aa, on=c(aa="bb")]
# to refer to x's join column as well because x[i, col] == x[i][, col] will not be TRUE anymore..
x <- data.table(aa = 1:3, cc = letters[1:3])
y <- data.table(bb = 3:5, dd = 3:1)
test(1640.1, x[y, x.aa, on=c(aa="bb")], INT(3,NA,NA))
test(1640.2, x[y, c(.SD, .(x.aa=x.aa)), on=c(aa="bb")], data.table(aa=3:5, cc=c("c", NA,NA), x.aa=INT(3,NA,NA)))

##########################

Expand Down
4 changes: 2 additions & 2 deletions man/data.table.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFacto
\emph{Advanced:} When \code{i} is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope.
See \href{../doc/datatable-keys-fast-subsets}{Keys and fast binary search based subsets}, \href{../doc/datatable-secondary-indices}{Secondary indices} and \href{datatable-extend-subsets-to-joins}{Extending subsets to joins} vignettes.
See \href{../doc/datatable-keys-fast-subset.html}{Keys and fast binary search based subset}, \href{../doc/datatable-secondary-indices-and-auto-indexing.html}{Secondary indices and auto indexing} and \href{../doc/datatable-extend-subsets-to-joins.html}{Extending subsets to joins} vignettes.
}
\item{j}{When \code{with=TRUE} (default), \code{j} is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables. This allows to not just \emph{select} columns in \code{j}, but also \code{compute} on them e.g., \code{x[, a]} and \code{x[, sum(a)]} returns \code{x$a} and \code{sum(x$a)} as a vector respectively. \code{x[, .(a, b)]} and \code{x[, .(sa=sum(a), sb=sum(b))]} returns a two column data.table each, the first simply \emph{selecting} columns \code{a, b} and the second \emph{computing} their sums.
Expand Down Expand Up @@ -143,7 +143,7 @@ data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFacto

\item{drop}{ Never used by \code{data.table}. Do not use. It needs to be here because \code{data.table} inherits from \code{data.frame}. See \href{../doc/datatable-faq.html}{datatable-faq}.}

\item{on}{ A named atomic vector of column names indicating which columns in \code{i} should be joined to which columns in \code{x}. When specified, this overrides the keys set on \code{x} and \code{i}. See \href{../doc/secondary-indices.html}{Secondary indices} and \href{../doc/datatable-extend-subsets-to-joins.html}{Extending subsets to joins} vignettes, and examples.}
\item{on}{ A named atomic vector of column names indicating which columns in \code{i} should be joined to which columns in \code{x}. When specified, this overrides the keys set on \code{x} and \code{i}. See \href{../doc/datatable-secondary-indices-and-auto-indexing.html}{Secondary indices and auto indexing} and \href{../doc/datatable-extend-subsets-to-joins.html}{Extending subsets to joins} vignettes, and examples.}
}
\details{
\code{data.table} builds on base \R functionality to reduce 2 types of time:\cr
Expand Down
Loading

0 comments on commit 644f3bd

Please sign in to comment.