Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R crashes on non-equi join #3401

Closed
raneameya opened this issue Feb 14, 2019 · 9 comments
Closed

R crashes on non-equi join #3401

raneameya opened this issue Feb 14, 2019 · 9 comments
Assignees
Labels
bug non-equi joins rolling, overlapping, non-equi joins regression

Comments

@raneameya
Copy link

Hi,

I think I'm encountering a weird bug where R crashes as I try to do a non-equi join. Apologies for not being able to create a minimal reproducible example. I have attached two data.tables, both with 10,000 rows and up to 4 columns.

Here is the code to (hopefully) reproduce the error

library(data.table)
DT1 <- readRDS('DT1.rds')
DT2 <- readRDS('DT2.rds')

# This does not work, R crashes on my system.
DT1[
  DT2, on = .(Month<=MonthFuture, Month>=MonthPast, FlightDetails==FlightCode)
]
# This works
set.seed(1)
n <- 1e3
DT1 <- DT1[sample(.N, n)]
DT2 <- DT2[sample(.N, n)]
DT1[
  DT2, on = .(Month<=MonthFuture, Month>=MonthPast, FlightDetails==FlightCode)
]

Output of sessionInfo() ----

R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_New Zealand.1252  LC_CTYPE=English_New Zealand.1252        LC_MONETARY=English_New Zealand.1252
[4] LC_NUMERIC=C                         LC_TIME=English_New Zealand.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.0

loaded via a namespace (and not attached):
[1] compiler_3.5.2 tools_3.5.2    yaml_2.2.0

Thank you. Let me know if there is anything I can provide to aid in debugging.

@raneameya
Copy link
Author

raneameya commented Feb 14, 2019

Also tested on the latest dev version, same error persists.
data.table 1.12.1 IN DEVELOPMENT built 2019-02-14 09:56:44 UTC;

@MichaelChirico
Copy link
Member

Might be unrelated but you should run setDT(DT1) after readRDS

@raneameya
Copy link
Author

raneameya commented Feb 14, 2019

Thanks Michael. I set DT1 and DT2 both with setDT and reran the commands, but R still crashed.

Were you able to reproduce this?

@arunsrinivasan
Copy link
Member

Works well up until 1.11.8. Seg faults from 1.12.0.

@arunsrinivasan
Copy link
Member

Seems like this is the commit that breaks: e59ba14#diff-3f6e5ca10e702fb2c499a882aa3447e0

arunsrinivasan added a commit that referenced this issue Feb 16, 2019
@arunsrinivasan arunsrinivasan self-assigned this Feb 16, 2019
@arunsrinivasan arunsrinivasan added bug non-equi joins rolling, overlapping, non-equi joins labels Feb 16, 2019
@raneameya
Copy link
Author

Thanks @arunsrinivasan! That helps. I'm using 1.11.8 for now. Will upgrade to the latest dev once your fix is merged.

Also, I wanted to thank you all for your great work in data.table. It has been a pleasure to use and its RAM efficiency has helped us avoid the purchase of a new computer that can support more than 64GB RAM for as long as was possible.

arunsrinivasan added a commit that referenced this issue Feb 16, 2019
* Fix segfault issue, #3401

* Need to Free().
@ethanbsmith
Copy link
Contributor

I just ran into this issue. super happy to find it has already been logged and patched. tested on my end against dev version and can confirm it solves my issue as well. you folks rock!!!

@jangorecki
Copy link
Member

And it will be landing on CRAN within days/hours

@ethanbsmith
Copy link
Contributor

I was able to work around this in my scenario by adding a pre-filter on x. I haven't fully thought this through, but this might be a possible generic optimization to reduce working set. If not, just ignore ;)

my original code was something like:```

d1[d2, on = .(rowid >= start.rowid, rowid < end.rowid, Cont.Low <= Top, Cont.High >= Bottom),
        allow.cartesian = T, by = .EACHI, 
        .(.N, res = sum(Cont.Low + Cont.High))]

by adding a filters on d1[rowid >= min(d2$start.rowid) & rowid <= max(d2$end.rowid)] I was able to get this to work:

  d1[rowid >= min(d2$start.rowid) & rowid <= max(d2$end.rowid)][d2, on = .(rowid >= start.rowid, rowid < end.rowid, Cont.Low <= Top, Cont.High >= Bottom),
        allow.cartesian = T, by = .EACHI, 
        .(.N, res=sum(Cont.Low + Cont.High))]

this filter on x is logically implied by the non-equi join conditions and doesn't actually affect the result, but seems (in my scenario) to bypass the memory allocation.

obviously there are all kinds of considerations, like is computing the min and max and applying the filter worth it and such. as I said, feel free to ignore if not useful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug non-equi joins rolling, overlapping, non-equi joins regression
Projects
None yet
Development

No branches or pull requests

5 participants