Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should be able to refer to i's .SD during a join. #935

Open
arunsrinivasan opened this issue Nov 7, 2014 · 9 comments
Open

Should be able to refer to i's .SD during a join. #935

arunsrinivasan opened this issue Nov 7, 2014 · 9 comments
Labels
feature request joins top request One of our most-requested issues

Comments

@arunsrinivasan
Copy link
Member

set.seed(45L)
dt1= data.table(x=rep(1:2, each=10), y=sample(8,20,TRUE), key="x")
dt2 = data.table(x=1:2, y1=1:2, y2=3:4, y3=5:6, y4=7:8, key="x")

Now if I'd like to join and for each match get the count of values y >= y_i. Using by=.EACHI this is quite simple:

dt1[dt2, .(sum(y>=y1), sum(y>=y2), sum(y>=y3), sum(y>=y4)), by=.EACHI]

Having a i.SD would allow usage of:

dt1[dt2, lapply(i.SD, function(x) sum(y > x)), by=.EACHI]
@matthieugomez
Copy link
Contributor

That's be nice. i.SD would also be useful if one wants to merge in place dt1[dt2, :=] and keeps dt2 columns

@MichaelChirico
Copy link
Member

Adding my hat to the ring for this, as it would help solve a problem I encountered recently, basically what @matthieugomez mentioned--

Namely, updating many columns by reference during a join operation, especially when the inner table may have hard-to-predict column names.

That is, DT1[DT2, (names(DT2) := mget(paste0("i.", names(DT2)))] works (though looks clunky), but this approach doesn't work if we're doing something like:

DT1[DT2[,(summarize some variables),by=list(other_variables)],
        (what_to_use?):=(hard/inconvenient to predict/control names),on=other_variables]

At least with i.SD, we'd only have to keep track of the number of variables to be joined.

Here's an example:

trts <- LETTERS[1:20]

DT1 <- data.table(trts, location = 1:20)

set.seed(1023)
nn <- 10000L
DT2 <- data.table(id = sample(1:25, nn, T),
                  trts = sample(trts, nn, T),
                  val1 = runif(nn),
                  val2 = (runif(nn) > .2))

DT1[dcast(DT2[, .(sum(val1), mean(val2)), by = .(id, trts)],
          trts ~ id, value.var = c("V1", "V2")), on = "trts"]

It's easy to merge, but not to update DT1 by reference.

@aryoda
Copy link

aryoda commented May 4, 2016

Any news on that FR?

Since the i.col and x.col syntax is now supported in joins since 1.9.7 this may be a good code base for implementing x.SD and i.SD variables?

Allow x's cols to be referred to using 'x.' prefix

I'd love to see this feature in the data.table!

@arunsrinivasan arunsrinivasan added this to the v2.0.0 milestone Jun 12, 2016
@UweBlock
Copy link
Contributor

There is a related question on SO: How can I access all columns of i when using .EACHI in data.table

@MichaelChirico
Copy link
Member

A bit strange that get works on 'i.col' strings but eval does not (as pointed out in #1180).

This answer is very clunky; revisit when this is closed

@jangorecki
Copy link
Member

jangorecki commented Apr 8, 2020

I spotted it is already possible to refer to i's .SD, but using .iSD name rather than i.SD, and only when using by=.EACHI. But it doesn't really give any meaningful answer.

d1 = data.table(id=1:3)
d2 = data.table(id=2:4, v1=5:7, v2=6:8)

d1[d2, .iSD, on="id", by=.EACHI]
#Empty data.table (0 rows and 3 cols): id,v1,v2

d1[d2, .SD, on="id", by=.EACHI]
#Empty data.table (0 rows and 3 cols): id,v1,v2

@ColeMiller1
Copy link
Contributor

Adding to Jan's observation, we can trick jisvars and jiscols by just including quote(mget) in a {...} expression:

library(data.table)
set.seed(45L)
dt1= data.table(x=rep(1:2, each=10), y=sample(8,20,TRUE), key="x")
dt2 = data.table(x=1:2, y1=1:2, y2=3:4, y3=5:6, y4=7:8, key="x")

dt1[dt2, .(sum(y>=y1), sum(y>=y2), sum(y>=y3), sum(y>=y4)), by=.EACHI]
#>    x V1 V2 V3 V4
#> 1: 1 10  9  5  3
#> 2: 2  9  9  4  1
dt1[dt2, {quote(mget)
  lapply(.iSD, function(x) sum(y >= x))}, by=.EACHI]
#>    x  x y1 y2 y3 y4
#> 1: 1 10 10  9  5  3
#> 2: 2  9  9  9  4  1

Here, I also have the x column allocated but it's pretty close. So, it seems like this would be easy to implement if we used some of what is done with .SD.

@UweBlock
Copy link
Contributor

UweBlock commented Jul 14, 2020

For the sake of completeness, we can get the same result as in ColeMiler1's comment with

dt1[dt2, lapply(mget(names(dt2)), function(x) sum(y >= x)), by = .EACHI]

or

dt1[dt2, lapply(mget(names(.iSD)), function(x) sum(y >= x)), by = .EACHI]

   x  x y1 y2 y3 y4
1: 1 10 10  9  5  3
2: 2  9  9  9  4  1

BTW, we can get rid of the redundant second x column by

dt1[dt2, lapply(mget(setdiff(names(.iSD), names(.BY))), function(x) sum(y >= x)), by=.EACHI]
   x y1 y2 y3 y4
1: 1 10  9  5  3
2: 2  9  9  4  1

data.table version 1.12.9 used.

@ColeMiller1
Copy link
Contributor

@UweBlock that seems more logical 😀

I assume .iSD would by default remove the joining columns. I think the other question is whether .iSD is always allocated or whether it depends on NSE detecting the use of get, mget, eval, or .iSD. To be consistent with .SD memory allocation approach, I assume we would use NSE.

@MichaelChirico MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request joins top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

8 participants