-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
default reduced to half of all logical CPUs #3435
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3435 +/- ##
==========================================
- Coverage 95.06% 95.05% -0.01%
==========================================
Files 65 65
Lines 12249 12254 +5
==========================================
+ Hits 11644 11648 +4
- Misses 605 606 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #3435 +/- ##
==========================================
+ Coverage 95.06% 95.09% +0.03%
==========================================
Files 65 65
Lines 12249 12302 +53
==========================================
+ Hits 11644 11699 +55
+ Misses 605 603 -2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, most users are operating on small data (thousands to millions of rows, tens of columns). Using half or full number of logical cores does not exhibit a significant difference in computing time as I test on my server with such data size.
When the data is very large (e.g. 1 billion rows and hundreds of columns), using half and full computing capacity may result in greater difference (but not as much as I expected as I test on my 40-core server with 70M+ rows and 50+ columns). However, I believe most user operating on such data size are working on servers like mine, or the data simply cannot be put into memory. And this is exactly the case where using all cores cause a problem since it is unlikely that such servers are used by only one user.
I'm perfectly okay with this change.
I haven't noticed much difference between 4 and 8 threads on my laptop either (I have 4 cores). Our algos are mostly cache bound. |
I personally think the default should be single-core since, as @renkun-ken remarked, more cores currently offers only small performance benefit on commodity hardware (with the possible exception of fread/fwrite), but can be much slower if there is other parallelism going on. In the latter case, the parallelism can't really be managed a priori; the operator would need to set the cores according to the particular apparatus being used. This is not really a problem with data.table -- I see it in lots of places, even outside R. One thing that strikes me about the present is the high flux of packages moving from single thread to parallel operations. In isolation this can provides performance benefits, but their interaction is a bit tricky. One can imagine a scenario where a user is using data.table with its implicit parallelism together with another package X. Package X then moves from single-core to multi-core and there's a performance regression. This can be pretty tricky to track down especially since the NEWS in package X will say the functions should be faster. I think fread and fwrite are special though: the performance benefit appears to be large and they are typically less likely to be used in combination with other packages. Thanks for inviting me to review. I've approved because I think the PR represents an improvement. |
I think it is strongly depends on data, query and how the other processes utilise resources. Single core vs multicore makes significant difference, at least for queries we are benchmarking. The huge time improvement you can see below is caused by parallel forder and parallel aggregations. If not those multicore improvements we would have been behind spark and pydatatable (both do aggregations using multiple cores). Machine we are using for that is (at the moment) 20 cpu, 125 GB mem. Once this PR will be merged I will run benchmark so we can see how halving cpus impacts speed. |
Glad we're all on the same page. I'm surprised by Hugh's comments that more threads don't often help other than for fread/fwrite. I suspect what's going on is a combination of recent improvements that really show on large data (db-bench) plus some aspects that now cause significant slow downs on small / iterated tasks which otherwise would are fast anyway without needing parallelism. For example, still need to look at uniqueN by group as raised here: #3395 (comment). So this PR would not fully close #3395, there'll be a follow up issue for that. I was going to add the |
what about recent @HughParsonage comment?
|
There are some failures on linux machines provided by gitlab, they might be single cpu instances, not sure.
db-bench is running on 50% threads as expected |
This comment has been minimized.
This comment has been minimized.
@jangorecki Very interesting thanks! Maybe something like 80% is optimal on the server? |
certainly for full disclosure insights we learn about tuning here should be
included in the benchmarks page
actually this is one of the toughest parts of using e.g. spark -- picking
levels of any of 30-40 hyperparameters for tuning your job without much in
the way of actionable guidance -- so anything we can offer there would also
be a big social benefit.
also, one could argue that if we have done this tuning for data.table and
not the other packages, were giving ourselves an unfair leg up...
…On Sat, Mar 2, 2019, 5:44 PM Matt Dowle ***@***.***> wrote:
@jangorecki <https://github.com/jangorecki> Very interesting thanks!
Maybe something like 80% is optimal on the server?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3435 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdYvfjuZYSk3729wnSljNGcCLca4Hks5vSkfugaJpZM4bV6O6>
.
|
@MichaelChirico we will use all cores in benchmark, where available, same as for spark and others. I doubt using less than 100% did ever result in better performance in that environment. There are generally no other processes running in the background. |
@jangorecki Those numbers don't look significant to me. Even with 10 billion groups the differences are only fractions of a second. And this is under optimal setting for parallelism: no other processes, no nested parallelism. I get no significant differences with a 'normal' setup (i.e. some other processes running) even with the aggregations most suited to parallelization. That is, when testing execution times of the same operation, sometimes the parallel version is faster, sometimes it's slower. Never more than 1% or so. My argument for single-core default is that the small gains under favourable conditions are outweighed by the risk of enormous performance regressions under unforeseen unfavourable conditions. |
Those numbers are up to 1 billion. Differences are up to 29% - using 50% logical cores, not huge difference, for ordinary production 50% logical cores will good enough, but for benchmark we should aim for fastest possible option. Will run benchmarks using single core so we will have bigger picture what would be the performance cost of such change. |
@HughParsonage the columns that look like fractions of seconds are actually percentages, I think. |
top 5 speed up - or actually top 5 smallest slow down
top 5 slow down
by
by
|
Closes #3298
Closes #3395
I investigated various methods for determining number of cores. The one referred to in #3395 (
RhpcBLASctl::get_num_cores()
) seems like the only one that actually works;parallel::detectCores()
doesn't work for me. ButRhpcBLASctl::get_num_cores()
reads the system files on linux to do it so could possibly be slow, and would incur a new dependency.So I figured why not just keep it very simple and default to 50% of logical CPUs. This leaves plenty of room for other processes. If the user wants to use more or less they can set OMP_NUM_THREADS up to the number of logical CPUs, or call
setDTthreads()
.