Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing crash when attempting to join on character(0) #4272

Merged
merged 11 commits into from
Jun 8, 2020
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,8 @@ unit = "s")

16. Non-equi joins now automatically set `allow.cartesian=TRUE`, [4489](https://github.com/Rdatatable/data.table/issues/4489). Thanks to @Henrik-P for reporting.

17. `X[Y, on=character(0)]` and `merge(X, Y, by.x=character(0), by.y=character(0))` no longer crash, [#4272](https://github.com/Rdatatable/data.table/pull/4272). Thanks to @tlapak for the PR.

## NOTES

0. Retrospective license change permission was sought from and granted by 4 contributors who were missed in [PR#2456](https://github.com/Rdatatable/data.table/pull/2456), [#4140](https://github.com/Rdatatable/data.table/pull/4140). We had used [GitHub's contributor page](https://github.com/Rdatatable/data.table/graphs/contributors) which omits 3 of these due to invalid email addresses, unlike GitLab's contributor page which includes the ids. The 4th omission was a PR to a script which should not have been excluded; a script is code too. We are sorry these contributors were not properly credited before. They have now been added to the contributors list as displayed on CRAN. All the contributors of code to data.table hold its copyright jointly; your contributions belong to you. You contributed to data.table when it had a particular license at that time, and you contributed on that basis. This is why in the last license change, all contributors of code were consulted and each had a veto.
Expand Down
2 changes: 1 addition & 1 deletion R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -3037,7 +3037,7 @@ isReallyReal = function(x) {
onsub = as.call(c(quote(c), onsub))
}
on = eval(onsub, parent.frame(2L), parent.frame(2L))
if (!is.character(on))
if (length(on) == 0L || !is.character(on))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, perfect. we also shouldn't have gotten to checking by.x&by.y separately in the first place because here by.x=by.y so simply by should be used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what you mean. At this point we're not checking separately if we come through merge. merge sets by=by.x and then later calls y[x, on=by]. If we don't check in merge we catch it here but this is the point where it gets caught when using x[y] syntax.

(I would've been really mad if you had pushed a fix yesterday.)

stop("'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.")
## extract the operators and potential variable names from 'on'.
## split at backticks to take care about variable names like `col1<=`.
Expand Down
4 changes: 2 additions & 2 deletions R/merge.R
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ merge.data.table = function(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FAL
if (!missing(by) && !missing(by.x))
warning("Supplied both `by` and `by.x/by.y`. `by` argument will be ignored.")
if (!is.null(by.x)) {
if ( !is.character(by.x) || !is.character(by.y))
stop("A non-empty vector of column names are required for `by.x` and `by.y`.")
if (length(by.x) == 0L || !is.character(by.x) || !is.character(by.y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha! I was just looking at this code yesterday and something looked funny but I didn't bother stress testing it. nice catch!

stop("A non-empty vector of column names is required for `by.x` and `by.y`.")
if (!all(by.x %chin% names(x)))
stop("Elements listed in `by.x` must be valid column names in x.")
if (!all(by.y %chin% names(y)))
Expand Down
13 changes: 12 additions & 1 deletion inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -15912,7 +15912,7 @@ test(2074.31, dcast(DT, V1 ~ z, fun.aggregate=eval(quote(length)), value.var='z'
test(2074.32, fwrite(DT, logical01=TRUE, logicalAsInt=TRUE), error="logicalAsInt has been renamed")

# merge.data.table
test(2074.33, merge(DT, DT, by.x = 1i, by.y=1i), error="A non-empty vector of column names are required")
test(2074.33, merge(DT, DT, by.x = 1i, by.y=1i), error="A non-empty vector of column names is required")

# shift naming
test(2074.34, shift(list(a=1:5, b=6:10), give.names=TRUE), list(a_lag_1=c(NA, 1:4), b_lag_1=c(NA, 6:9)))
Expand Down Expand Up @@ -16682,6 +16682,9 @@ options(old_width)
DT = data.table(A="a", key="A")
test(2126.1, DT[J(NULL)], DT[0])
test(2126.2, DT[data.table()], DT[0])
# additional segfault when i is NULL and roll = 'nearest'
test(2126.3, DT[J(NULL), roll = 'nearest'], DT[0])
test(2126.4, DT[data.table(), roll = 'nearest'], DT[0])

# fcase, #3823
test_vec1 = -5L:5L < 0L
Expand Down Expand Up @@ -16947,3 +16950,11 @@ if (.Platform$OS.type=="windows") local({
# test back to English (the argument order is back to 1,c,2,1)
test(2144, rbind(DT,list(c=4L,a=7L)), error="Column 1 ['c'] of item 2 is missing in item 1")

# Attempting to join on character(0) shouldn't crash R
A = data.table(A='a')
B = data.table(B='b')
test(2145.1, A[B, on=character(0)], error = "'on' argument should be a named atomic vector")
test(2145.2, merge(A, B, by=character(0) ), error = "non-empty vector of column names for `by` is required.")
test(2145.3, merge(A, B, by.x=character(0), by.y=character(0)), error = "non-empty vector of column names is required")
# Also shouldn't crash when using internal functions
test(2145.4, bmerge(A, B, integer(), integer(), 0, c(FALSE, TRUE), NA, 'all', integer(), FALSE), error = 'icols and xcols must be non-empty')
4 changes: 3 additions & 1 deletion src/bmerge.c
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ SEXP bmerge(SEXP iArg, SEXP xArg, SEXP icolsArg, SEXP xcolsArg, SEXP isorted, SE
i = iArg; x = xArg; // set globals so bmerge_r can see them.
if (!isInteger(icolsArg)) error(_("Internal error: icols is not integer vector")); // # nocov
if (!isInteger(xcolsArg)) error(_("Internal error: xcols is not integer vector")); // # nocov
if ((LENGTH(icolsArg) == 0 || LENGTH(xcolsArg) == 0) && LENGTH(i) > 0) // We let through LENGTH(i) == 0 for tests 2126.*
error(_("Internal error: icols and xcols must be non-empty integer vectors."));
if (LENGTH(icolsArg) > LENGTH(xcolsArg)) error(_("Internal error: length(icols) [%d] > length(xcols) [%d]"), LENGTH(icolsArg), LENGTH(xcolsArg)); // # nocov
icols = INTEGER(icolsArg);
xcols = INTEGER(xcolsArg);
Expand All @@ -68,7 +70,7 @@ SEXP bmerge(SEXP iArg, SEXP xArg, SEXP icolsArg, SEXP xcolsArg, SEXP isorted, SE
roll = 0.0; rollToNearest = FALSE;
if (isString(rollarg)) {
if (strcmp(CHAR(STRING_ELT(rollarg,0)),"nearest") != 0) error(_("roll is character but not 'nearest'"));
if (TYPEOF(VECTOR_ELT(i, icols[ncol-1]-1))==STRSXP) error(_("roll='nearest' can't be applied to a character column, yet."));
if (ncol > 0 && TYPEOF(VECTOR_ELT(i, icols[ncol-1]-1))==STRSXP) error(_("roll='nearest' can't be applied to a character column, yet."));
jangorecki marked this conversation as resolved.
Show resolved Hide resolved
roll=1.0; rollToNearest=TRUE; // the 1.0 here is just any non-0.0, so roll!=0.0 can be used later
} else {
if (!isReal(rollarg)) error(_("Internal error: roll is not character or double")); // # nocov
Expand Down