-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] gforce mean and sum in parallel #3042
Comments
Related: #2533 |
@st-pasha thanks, added to first issue. |
I posted here last week, but there were some mistakes in my post and it was long and confusing, so I deleted it again. Sorry for that. This is my 2nd attempt: I have a function which I would like to apply in grouped-by fashion. Applying this function takes much longer as the Let's say, I have data like new_dt <- function(n = 3e6) {
data.table::data.table(
a = rep(LETTERS, each = n),
b = rep(letters, times = n),
c = runif(length(letters) * n, 0, 1),
d = runif(length(letters) * n, 1, 2),
e = runif(length(letters) * n, 1, 3),
f = rnorm(length(letters) * n, 0, 1),
g = rnorm(length(letters) * n, 1, 2),
h = rnorm(length(letters) * n, 2, 1)
)
}
tbl <- new_dt() and two functions col_var <- function(x) lapply(x, var)
plus_one <- function(x) lapply(x, function(y) y + 1) Now if I split the data into n_core partitions (by adding a group index), run in parallel with dt[GroupIndex == i, c(use_cols) := fun(.SD), by = group_by, .SDcols = use_cols] this works fine with For something like this to work efficiently, we need a way to place a
Some of my experiment code is available from here. |
@nbenn Also, please note that many groupby tasks do not actually require putting data.table into shared memory and forking R process via |
Sure, this isn't great. You could make the grouping col an index, or create a vector like grp_vec <- parallel::splitIndices(nrow(dt), n_cores) and use this to subset the
What do you mean by many groupby tasks? A handful of functions like |
@nbenn Great, it helps to clear things up -- that's why we have this discussion in the first place! When you say "what do you mean by many groupby tasks" -- I really do mean the handful of functions like Which is NOT to say that your use case is somehow less important -- clearly it is important to you and to many people like you. So it should be implemented too. So, I think this discussion helped us understand the following: there are 2 (at least) use cases for "parallel grouping": (1) simple group-by reductions like |
Yes, @st-pasha I agree that splitting this issue makes sense. The reason why I posted here is that there are already quite a few issues surrounding this topic of "parallel group-by operation" that I did not want to create yet another new issue. But as the two scenarios you are outlining above require different approaches, it does make sense. Thank you for engaging in this discussion. |
The focus of this FR at the top was the 5 grouping tests on https://h2oai.github.io/db-benchmark/. |
As highlighted in recently updated grouping benchmark https://h2oai.github.io/db-benchmark/ data.table is already lagged behind some other tools, precisely speaking those that can perform aggregation using multiple cores. To keep up with the competition we need to parallelize grouping.
Related issues:
+
,sum
and many others #2919 aggregate but not group by - parallelism applied on different (lower level) loopWe should try to make it for 1.12.0.
The text was updated successfully, but these errors were encountered: