Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upcoming versions of base R eliminate the need for DT() functionality - consider eliminating? #5621

Open
msummersgill opened this issue Mar 24, 2023 · 17 comments
Labels
DT() top request One of our most-requested issues

Comments

@msummersgill
Copy link

As an experimental feature the placeholder _ can now also be used in the ‘rhs’ of a forward pipe |> expression as the first argument in an extraction call, such as _$coef. More generally, it can be used as the head of a chain of extractions, such as _$coef[[2]].

R-devel/News, Mon, 13 Feb 2023

In upcoming versions of R, syntax such as the following will be supported:

as.data.table(mtcars) |> 
  _[am == 1] |> 
  _[, .(maxhp = max(hp)), by = .(cyl)]

#    cyl maxhp
# 1:   6   175
# 2:   4   113
# 3:   8   335

With this in mind, I think it's worth reconsidering whether DT() should remain on the release roadmap.

Some arguments in favor of scrapping DT() are as follows:

That being said, this is just my two cents as a satisfied user of data.table interested in the long term success of the package. Happy to hear counter arguments, and open to the idea that the broader community may see enough value to finish the push to support DT().

@tdhock
Copy link
Member

tdhock commented Mar 27, 2023

I agree it would be good to remove if possible, to avoid user confusion, and to save dev time.

@MichaelChirico
Copy link
Member

given the number of headaches induced by what was supposed to be a "simple" wrapper, I would also be very happy to drop DT().

that said, I would pause on this issue until that syntax lands in a released version.

@MichaelChirico
Copy link
Member

@markfairbanks care to elaborate? it will be important feedback.

@markfairbanks
Copy link

markfairbanks commented Mar 27, 2023

I would love to have a DT() that would work on older versions of R (whether to v4.0.0 with the base pipe or with v3.4.0 with the magrittr pipe). It's good that they're adding the _ placeholder for the base pipe but that won't help very much with backwards compatibility for users that don't always have access to the newest version of R.

And as far as the current issues - every issue linked in the original comment goes back to the fact that DT() as currently constructed tries to work on data.frames. It seems much more logical to have DT() only work on data.tables instead of removing it entirely. Having it expand to data.frame capability can be added in a later version if people really want it (and then there would be more time to work out the kinks). Or if it's not something that's fixable it can just work on data.tables into the future.

Something like this:

library(data.table)

DT <- function(x, ...) {
  if (!is.data.table(x)) {
    stop("`x` must be a data.table")
  }
  x[...]
}

df <- data.table(x = 1:3, y = c("a", "a", "b"))

copy(df) |>
  DT(, double_x := x * 2) |>
  DT()
#>        x      y double_x
#>    <int> <char>    <num>
#> 1:     1      a        2
#> 2:     2      a        4
#> 3:     3      b        6

There wouldn't be anything "surprising" with how it works because it would work exactly like [.data.table but work better with piping.

@msummersgill
Copy link
Author

msummersgill commented Mar 27, 2023

that said, I would pause on this issue until that syntax lands in a released version.
@MichaelChirico

Agree - but perhaps might be a reason to hold back DT() from CRAN releases in the interim? It looks like this change made it into R 4.3 (news item is included in changelog of release candidate R-latest.tar.gz), which has a planned release date of April 21st, 2023, so there shouldn't be too long of a wait.

I would love to have a DT() that would work on older versions of R (whether to v4.0.0 with the base pipe or with v3.4.0 with the magrittr pipe).
@markfairbanks

Since backporting |> isn't an option, the only portable solution is to use magrittr, which has always allowed [ operations on the . placeholder. Does that address the desire here?

## Works R > 3.4 using magrittr
as.data.table(mtcars) %>%
  .[am == 1] %>% 
  .[, .(maxhp = max(hp)), by = .(cyl)]

@markfairbanks
Copy link

I think it still makes more sense to have a functional DT() - a version that is designed to work with pipes without any extra thought or placeholders 🤷‍♂️ In general the placeholder seems like a workaround, not integrated functionality.

DT() also makes everything much more approachable for newer R users trying to use piping with data.table or dplyr users making the switch.

@eddelbuettel
Copy link
Contributor

eddelbuettel commented Mar 31, 2023

As we all got used to working with the dev version of data.table() I too had some use cases for DT() and grew reasonably fond of it -- in particular for one (only partially done) project where I translated some code 'from that other paradigm'.

So with the caveat that I am not fully up to speed on any implementation headaches, my $0.02 are for shipping DT() as it is a nice usability improvement, especially for newbs.

@Kamgang-B
Copy link
Contributor

Kamgang-B commented Apr 2, 2023

I think that it would be useful to export DT() only when it will be mature enough to properly handle data.frame objects in general (DT() tested enough and all pending issues solved). Given the number of related issues, exporting DT() should not necessarily be considered for the upcoming data.table releases as it will take an important amount of time to solve all of them.

I personally believe that the most important benefit of DT() is the fact that it can be used with data.frame objects in general and not only with data.tables. If only data.table object is considered, I think that the upcoming R version eliminates the need for DT().

But I believe that DT() has one benefit that is not eliminated by the upcoming R version: it allows the use of data.table syntax on data.frame objects in general. I think that one of the main limitations of data.table so far is related to the fact that its syntax works only with data.table objects. So, solving the issues related to DT() will remove this limitation.

While piping using the square brackets (repeatedly) was hard to follow and kept indenting the codes to the right, the upcoming R version literally solves this issue. I really like the fact that the square brackets are vertically aligned (with no additional indentation).

In conclusion, I would suggest to keep DT() only if it will also support data.frames in general (and not only data.tables); otherwise, I think that the upcoming R eliminates the need for DT().

@jangorecki
Copy link
Member

Whenever size is not a problem you can just alias DT=as.data.table and use [

@tdhock
Copy link
Member

tdhock commented May 1, 2023

another couple pieces of information that may be relevant to this discussion

@tdhock
Copy link
Member

tdhock commented May 4, 2023

if we do keep DT, can someone (@markfairbanks or?) please volunteer to maintain it and review changes to it, by adding your name to the new/proposed CODEOWNERS file? #5629 (currently DT is defined in data.table.R but maybe would be better to move it to a separate file, to make it clear who is responsible for it)

@jangorecki
Copy link
Member

Anyway DT() is not going to land on CRAN as long as pending issues related to it are not being addressed. Putting it on CRAN having those issue would be a nightmare, and most likely new issues would be detected as well.
For new coming (non-patch) release what is currently planned is to un-export DT() as described in #5472

@sluga
Copy link
Contributor

sluga commented Feb 5, 2024

FYI in the survey about 2/3 of users indicated they'd find such a convenience function somewhat or very useful. (I've submitted a summary of results for publication on the data.table blog.)

@sluga
Copy link
Contributor

sluga commented Feb 5, 2024

One thing I don't like about the name DT is that it requires holding down the Shift key.
And you'd type that a lot if you write pipe-heavy code.

An alternative I'd propose is do. Short and easy to type, no Shift needed, makes sense semantically as well.
The biggest drawback I can think of is that it's already a function in dplyr. But it was marked as superseded in v1 (and it's not like dplyr has never taken a name that was already in use :)).

@TysonStanley
Copy link
Member

I agree that something simple (and lowercase) would be useful. do makes some sense but I think the function should be clearly tied to data.table and should be something with minimal conflicts with other packages (especially dplyr).

@sluga
Copy link
Contributor

sluga commented Feb 5, 2024

Since dt is already taken, maybe td? Probably too weird though.

I don't quite get why the name should be 'clearly tied to data.table'. If the package includes e.g. set, why not do?

@TysonStanley
Copy link
Member

Yeah I know what you mean. I don't think it would have to be dt or td or anything with the exact same name. Just meant having some clear connection so it's easier for our users to find this function and know what it does. do I think is expressive in what it's doing but if there are conflicts with popular packages, we are just making it harder for adoption in this context. But maybe it's not too bad since do is being abandoned in dplyr. Can't think of another short word that would be as expressive as do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DT() top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

9 participants