Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Both columns for rolling and non-equi joins #3093

Open
wants to merge 40 commits into
base: master
Choose a base branch
from

Conversation

mattdowle
Copy link
Member

@mattdowle mattdowle commented Oct 3, 2018

I can't push to remote branch sritchie73:non-equi-key so copied it into main from PR #2706.
Looks like commit history from Scott is retained which is good, and retained the branch name.
The extended discussion history will just be in the previous PR #2706 though. Can't think of a way round this. Scott is member of project so can create branches in main going forward.

Closes #1615
Closes #1700
Closes #2006
Closes #2569
Other related issues already closed (e.g. dups) : #1469, #1807, #2307, #2595, #2602

  • test.data.table() fails with 28 errors when datatable.bothkeycols is set to TRUE (caused by typo datatable.bothkeycol vs datatable.bothkeycols)
  • bothkeycols= renamed to old.nonequi= (explanation in comments below)
  • Place extra roll= column just after the first one, not at the end of the result.
  • Add short examples to the news item

sritchie73 and others added 30 commits March 26, 2018 13:54
New SQL-like column return now works when j is provided and there are only non-equi / roll key columns in 'on'.
…n(s)

129: added additional V1 column from J() to expected output.
130: added additional V1 column from J() to expected output.
131: added additional V1 column from J() to expected output.
148: added additional V1 column from CJ() to expected output.
149: added additional V1 column from J() to expected output.
917: updated expected output. added i.ts and updated ts (from x) to contain matched values from x$ts instead of matched values from i$ts.
1317.1: updated expected output. added i.y and updated y (from x) to contain matched values from x$y instead of matched values from i$y.
1317.2: updated expected output. added i.y and updated y (from x) to contain matched values from x$y instead of matched values from i$y.
1470.1: rolling join on J(), renamed x column from output as V1 and put in correct position, then added x as DT$x.
1470.2: rolling join on J(), renamed x column from output as V1 and put in correct position, then added x as DT$x.
1470.3: rolling join on SJ(), renamed x column from output as V1 and put in correct position, then added x as DT$x.
1614.1: updated expected output. renamed x column from output as V1 and put in correct position, then added dt$x column to expected output.
1614.2: updated expected output. renamed x column from output as V1 and put in correct position, then added dt$x column to expected output.
1614.3: updated expected output. renamed x column from output as V1 and put in correct position, then added dt$x column to expected output.
1614.4: updated expected output. renamed x column from output as V1 and put in correct position, then added dt$x column to expected output.
1660.1: updated expected return columns.
1660.2: updated expected return columns.
1682.1: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI`
1682.2: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI`
1682.3: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI`
1682.4: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI`
1682.5: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI`
1682.6: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI`
1682.7: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI`

New bugs (test ouput different from expected new behaviour):

1660.1: "End" column missing from output, and Location column has contents of dt1$end even though it is involved in a non-equi join.
1660.2: "Location" column missing from output.

1682.1: Column V2 is missing from non-equi join result
1682.2: Column V2 is missing from non-equi join result
1682.3: Column V2 is missing from non-equi join result
1682.4: Column V2 is missing from non-equi join result
1682.5: Column V2 is missing from non-equi join result
1682.6: Column V2 is missing from non-equi join result
1682.7: Column V2 is missing from non-equi join result

Bugs that have regressed with this PR and need to be patched again:

1469.2: after roll DT now has a key when it shouldnt

Tests that need following up, but may not be due to introduced bugs

1540.15: unexpected difference in column order (but not contents) after rolling join.
1614.4: values in dt$x do not match those expected based on values in dt$y, but this might be due to rollends=TRUE.
1846.1: columns do not match - check that handling of 0-row dt in i is correct based on linked issue.
1846.2 columns do not match - check that handling of 0-row dt in i is correct based on linked issue.
Expected output still had value from `J(4)` in `a` column instead of `TESTDT$a`.
Expected output still had values from `CJ(2,3)` in `a` column instead of `dt$a`.
missing c() inside of as.integer64
Added columns from empty d1 to expected output and and added column names prefixed with i. for the columns from d2.
Tests 1660.1, 1660.2 and 1846.1 now pass
To get the equivalent grouping of `by = .EACHI` when chaining operators you now have to specify `by = .(key.x = key.i)` to get the same answer and column names
This test checks whether a rolling join can be perform when a key has been set in the left data.table (DT1.copy), but not in the the right data.table (DT3). DT3 has to be reordered in this case so that the column to join on is the first column in DT3.

Now that the rolling join column y is returned from both DT1 and DT3, the test needed to be updated to reorder the columns in the expected output to match the ordering in the tested join.
Updated columns extracted in j to `V1` to reflect `x` is now from `DT` rather than `J(c(2,0))`. When extracting `x` instead of `V1`, `irows` is sorted so the result has `key() == "x"`
Functionality tested is captured by earlier unit tests that have been updated.
When TRUE (default) non-equi and roll columns are returned from both x and i.
options(datatable.bothkeycols=FALSE) is set at the top of the file so these tests can remain unmodified
incremented test number in tests.Rraw to solve merge conflict.

Merge branch 'master' into non-equi-key

# Conflicts:
#	inst/tests/tests.Rraw
@mattdowle
Copy link
Member Author

@sritchie73 Hence new strategy moving roll= into nomatch=. Better?

@sritchie73
Copy link
Contributor

sritchie73 commented Oct 5, 2018

The new behavior also impacts non-equi joins, and has the potential to cause nasty surprises for those who expect the old column naming behavior, e.g. see the date column in this example:

> DT = data.table(id=c("foo","foo","bar"), date=c(1L,3L,5L), val=6:8, key="id,date")
> DT2 <- data.table(id = "foo", date.V2 = 2)
> DT[DT2, .(id, date, i.date.V2, x.date), on = .(id, date < date.V2), old.nonequi = TRUE] 
    id date i.date.V2 x.date
1: foo    2         2      1
> DT[DT2, .(id, date, i.date.V2, x.date),  on = .(id, date < date.V2)]
    id date i.date.V2 x.date
1: foo    1         2      1

I also think we do want to return both join columns when performing a rolling join, e.g. as @MarkusBonsch demonstrated in this comment, although I suspect you might be able to use non-equi joins in conjunction with mult= to achieve this.

@mattdowle
Copy link
Member Author

Did you see I wrote :

We'd still have the issue of managing breaking changes in non-equi on=.

@sritchie73
Copy link
Contributor

sritchie73 commented Oct 5, 2018

Sorry, must have missed it.

In any case its useful to have the above example demonstrating the breaking change for others coming to this PR so they don't need to dig through the old one.

What are your thoughts on the following if we move roll= into nomatch=?:

I also think we do want to return both join columns when performing a rolling join, e.g. as @MarkusBonsch demonstrated in this comment, although I suspect you might be able to use non-equi joins in conjunction with mult= to achieve this.

@mattdowle
Copy link
Member Author

Again, I'm unsure why you're pointing to it in a way that makes me think you think I disagree or missed it. Did you see I wrote :

The new rolling join behaviour of this PR (i.e. returning both columns and with the different contents of the column with the same name) would be in nomatch= as per #857.

Are you reading on a phone or via email, perhaps? It's as if you are either not seeing all comments or not reading closely.

@sritchie73
Copy link
Contributor

I'm not sure I follow. My understanding is that #857 proposes to remove roll= entirely, and move the functionality to nomatch=. A rolling join with nomatch= would "fill in" the values of the (single) join column where there is no direct match. The example I link to shows are case where you want the functionality of a rolling join, but where it makes sense to preserve both columns in the result. Unless I'm missing something, it would be impossible to do this with nomatch=?

@mattdowle
Copy link
Member Author

Currently a rolling join ignores nomatch=. But a rolling join is what to do when there is no match, which is the idea of #857 to do it in nomatch=.
At the top of #857 it suggests a symbol nomatch=locf. Then later Arun suggests a function nomatch=LOCF(...) which I'm currently liking since a function can accept the limit and end control and avoids a new symbol: #857 (comment)

@MichaelChirico
Copy link
Member

I believe this Q I was struggling a bit to answer cleanly on SO yesterday is related:

https://stackoverflow.com/q/56916899/3576984

@jangorecki
Copy link
Member

@sritchie73 could you please merge to recent master and work out code coverage?

@tdhock
Copy link
Member

tdhock commented Oct 5, 2021

hi any update/plan to merge this soon? I would love to use non-equi join more often in research and teaching and this PR would make it so much easier to use/explain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants