Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for on-the-fly columns in on #1639

Open
MichaelChirico opened this issue Apr 10, 2016 · 10 comments
Open

Support for on-the-fly columns in on #1639

MichaelChirico opened this issue Apr 10, 2016 · 10 comments
Assignees
Labels
enhancement joins top request One of our most-requested issues

Comments

@MichaelChirico
Copy link
Member

The ability to create columns in on on-the-fly, like we can in j or by, would be appreciated.

As a usage example, this would have simplified my answer to the following self-join / moving-average-type answer:

http://stackoverflow.com/a/36534824/3576984

Reproduced here:

require(data.table)
date <- as.Date(c('2012-01-01','2012-01-01','2012-01-01',
                  '2012-01-02','2012-01-02','2012-01-03',
                  '2012-01-04','2012-01-05','2012-01-05',
                  '2012-01-06','2012-01-06','2012-01-06'))
email <- c('[email protected]', '[email protected]','[email protected]',
           '[email protected]', '[email protected]','[email protected]',
           '[email protected]','[email protected]','[email protected]',
           '[email protected]','[email protected]','[email protected]')
dt <- data.table(date, email)

dt[dt[ , .(date3=date, date2 = date - 2, email)], 
   on = .(date >= date2, date<=date3), 
   allow.cartesian = TRUE
   ][ , .(count = uniqueN(email)), 
      by = .(date = date + 2)]

Would have been prettier / easier to parse as:

dt[dt, on = .(date >= date-2, date<=date), allow.cartesian = TRUE
   ][ , .(count = uniqueN(email)), by = date]
@jangorecki
Copy link
Member

looks like a duplicate of #625

@aryoda
Copy link

aryoda commented May 4, 2016

Big vote for this enhancement since it makes the syntax of data.table shorter, more intuitive and more SQL alike.

@arunsrinivasan
Copy link
Member

arunsrinivasan commented Jun 25, 2016

@arunsrinivasan arunsrinivasan added this to the v2.0.0 milestone Jul 5, 2016
@arunsrinivasan arunsrinivasan self-assigned this Aug 24, 2016
@Henrik-P
Copy link

Henrik-P commented Mar 2, 2018

Because Arun started a list with SO posts which may be updated, I thought it I could add some more I stumbled over:

How to merge dataframes of unequal length based on time with buffer intervals in R?

Data.Table non-equi join with arithmetic operations

subset data.frame base on a time interval + or - list of dates

@franknarf1
Copy link
Contributor

To update: https://stackoverflow.com/q/52901099

@jaapwalhout
Copy link

To update: https://stackoverflow.com/q/53559119/2204410
(or close as a duplicate with one of the links above)

@UweBlock
Copy link
Contributor

@ColeMiller1
Copy link
Contributor

What would the API be? The difficult part is that users may want to use the newly created on condition. The evaluation would be easier if we could get users to use a helper function if they want to access the new column in j.

dt[dt, 
   on = .(date >= list(past_date = date-2), 
          date <= date), 
   allow.cartesian = TRUE]

The alternative would be like SQL and have the eval happen twice, once in on and then have users repeat in j

dt[dt,
   on = .(date >= date-2,
          date<=date), 
   .(x.date, past_date = i.date - 2),
   allow.cartesian = TRUE]

@UweBlock
Copy link
Contributor

UweBlock commented Mar 8, 2020

@MichaelChirico MichaelChirico added top request One of our most-requested issues and removed High labels Jun 7, 2020
@jangorecki
Copy link
Member

AFAIR bmerge already do shallow copies of data, so adding new columns there would be probably easiest way to achieve that. We just need a proper handling for is.call of LHS and RHS of each element of on argument using internal .parse_on, and then using that in bmerge. The most tricky part can be handling column names well. Eventually we could allow computed columns to be named, the same as we allow in by argument: .(date2 = date >= date-2, date<=date).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement joins top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

10 participants