-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-threaded group by #2533
Comments
I thought this was implemented already, but apparently not:
@mattdowle is it not strange that the "unthreaded" version is in fact going faster (i found this pretty consistently)? |
@MichaelChirico could you retry on your machine to see if you still experience slowdown (after swapping order of calls also)? Second call can eventually be a little bit faster due to caching (so my timings above are close to equal). Grouping is not parallelized but we still need to track slowdown in cases you presented. Recent improvements to gfunctions might have fixed that. |
Hmm I got
I may have run that from my other machine (more RAM) |
Please can we have a new policy that when posting timings or benchmarks,. verbose=TRUE should be set and the output included in the issue. The verbose output includes some timings of some operations and it really helps to see it up front. I'm looking at the result 9 months ago above and wondering if uniqlist was significant in those. There is a timing in the verbose output for that. And uniqlist was recently improved. |
According to timings below the issue is already resolved by rework of subset by Matt, already published in 1.12.0. Code to reproducelibrary(data.table)
NN = 1e9
set.seed(304093)
DT = data.table(grp = sample(8L, NN, TRUE),
V = rpois(NN, 10), key = 'grp')
(default_threads = getDTthreads())
system.time(DT[ , mean(V), keyby = grp])
setDTthreads(1)
system.time(DT[ , mean(V), keyby = grp]) Resultsmachine A (20 cores)current latest 1.12.1
1.11.6
machine B (40 cores)current latest 1.12.1
1.11.6
|
I wonder if it's technically possible to do multi-threaded group-by? I tested some multi-threading group by in Julia using a divide and conquer algorithm and I can make sum-by faster. So if data.table has multi-threaded group-by then things should speed up even more
The text was updated successfully, but these errors were encountered: