default reduced to half of all logical CPUs #3435

mattdowle · 2019-02-28T03:06:33Z

Closes #3298
Closes #3395

I investigated various methods for determining number of cores. The one referred to in #3395 (RhpcBLASctl::get_num_cores()) seems like the only one that actually works; parallel::detectCores() doesn't work for me. But RhpcBLASctl::get_num_cores() reads the system files on linux to do it so could possibly be slow, and would incur a new dependency.
So I figured why not just keep it very simple and default to 50% of logical CPUs. This leaves plenty of room for other processes. If the user wants to use more or less they can set OMP_NUM_THREADS up to the number of logical CPUs, or call setDTthreads().

codecov · 2019-02-28T03:32:08Z

Codecov Report

Merging #3435 into master will decrease coverage by <.01%.
The diff coverage is 84.61%.

@@            Coverage Diff             @@
##           master    #3435      +/-   ##
==========================================
- Coverage   95.06%   95.05%   -0.01%     
==========================================
  Files          65       65              
  Lines       12249    12254       +5     
==========================================
+ Hits        11644    11648       +4     
- Misses        605      606       +1

Impacted Files	Coverage Δ
src/openmp-utils.c	`92.1% <84.61%> (-1.84%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d4459d...5b6ded1. Read the comment docs.

codecov · 2019-02-28T03:32:08Z

Codecov Report

Merging #3435 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3435      +/-   ##
==========================================
+ Coverage   95.06%   95.09%   +0.03%     
==========================================
  Files          65       65              
  Lines       12249    12302      +53     
==========================================
+ Hits        11644    11699      +55     
+ Misses        605      603       -2

Impacted Files	Coverage Δ
src/fread.c	`98.49% <ø> (+0.15%)`	⬆️
src/openmp-utils.c	`100% <100%> (+6.06%)`	⬆️
R/openmp-utils.R	`100% <100%> (ø)`	⬆️
src/init.c	`93.93% <100%> (+0.09%)`	⬆️
R/test.data.table.R	`95.74% <100%> (-0.03%)`	⬇️
src/gsumm.c	`90.68% <0%> (-0.23%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d4459d...324b4b2. Read the comment docs.

NEWS.md

man/openmp-utils.Rd

renkun-ken

AFAIK, most users are operating on small data (thousands to millions of rows, tens of columns). Using half or full number of logical cores does not exhibit a significant difference in computing time as I test on my server with such data size.

When the data is very large (e.g. 1 billion rows and hundreds of columns), using half and full computing capacity may result in greater difference (but not as much as I expected as I test on my 40-core server with 70M+ rows and 50+ columns). However, I believe most user operating on such data size are working on servers like mine, or the data simply cannot be put into memory. And this is exactly the case where using all cores cause a problem since it is unlikely that such servers are used by only one user.

I'm perfectly okay with this change.

mattdowle · 2019-02-28T06:23:37Z

I haven't noticed much difference between 4 and 8 threads on my laptop either (I have 4 cores). Our algos are mostly cache bound.
I see Jan's point there's an inconvenience aspect to needing to use setDTthreads() and know the number to give it. Also the package has to be loaded first to use it, unlike options() which can be set in .Rprofile on a server without user needing to change any code to add the setDTthreads() call.
How about a new option: options(datatable.logicalcpu.percent=0.50) to control the default. Once the option is there, Jan could see the impact it has on db-bench.

HughParsonage · 2019-02-28T08:44:20Z

I personally think the default should be single-core since, as @renkun-ken remarked, more cores currently offers only small performance benefit on commodity hardware (with the possible exception of fread/fwrite), but can be much slower if there is other parallelism going on. In the latter case, the parallelism can't really be managed a priori; the operator would need to set the cores according to the particular apparatus being used.

This is not really a problem with data.table -- I see it in lots of places, even outside R. One thing that strikes me about the present is the high flux of packages moving from single thread to parallel operations. In isolation this can provides performance benefits, but their interaction is a bit tricky. One can imagine a scenario where a user is using data.table with its implicit parallelism together with another package X. Package X then moves from single-core to multi-core and there's a performance regression. This can be pretty tricky to track down especially since the NEWS in package X will say the functions should be faster.

I think fread and fwrite are special though: the performance benefit appears to be large and they are typically less likely to be used in combination with other packages.

Thanks for inviting me to review. I've approved because I think the PR represents an improvement.

jangorecki · 2019-02-28T09:09:47Z

I think it is strongly depends on data, query and how the other processes utilise resources. Single core vs multicore makes significant difference, at least for queries we are benchmarking. The huge time improvement you can see below is caused by parallel forder and parallel aggregations. If not those multicore improvements we would have been behind spark and pydatatable (both do aggregations using multiple cores).

Machine we are using for that is (at the moment) 20 cpu, 125 GB mem. Once this PR will be merged I will run benchmark so we can see how halving cpus impacts speed.

mattdowle · 2019-02-28T19:57:56Z

Glad we're all on the same page. I'm surprised by Hugh's comments that more threads don't often help other than for fread/fwrite. I suspect what's going on is a combination of recent improvements that really show on large data (db-bench) plus some aspects that now cause significant slow downs on small / iterated tasks which otherwise would are fast anyway without needing parallelism. For example, still need to look at uniqueN by group as raised here: #3395 (comment). So this PR would not fully close #3395, there'll be a follow up issue for that.

I was going to add the option(datatable.logicalcpu.percent) before merging so Jan didn't need to retest. But the tricky thing is we don't want to read and re-read this option every time we call getDTthread() which happens quite a lot internally for each and every parallel region. If user changes this option, they'll expect their change to be read and we don't want to have to explain that it's only a startup opton. To convey that it's a default option which is read once when data.table is loaded, I'm now thinking to create 2 new environment variables instead: R_DATATABLE_NCPU_PERCENT and R_DATATABLE_NCPU. Such variables can be set on command line or set in R using Sys.setenv() before loading data.table. This gives the maximum flexibility for production usage. If both are provided then the minimum of them. Startup message could print the result (as Jan suggested). If unset the default for R_DATATABLE_NCPU_PERCENT would be 0.50.

jangorecki · 2019-03-01T04:08:45Z

what about recent @HughParsonage comment?

Perhaps it would be better to have threads = NULL be the default, keep threads = 0 to retain the previous behaviour, and special-case if (is.null(threads)) threads = <half of CPU> else threads. That way, people relying on threads = 0 to work as documented can continue to, but people just using 'the data.table default' are moved to the new default.

data.table::setDTthreads can be added to .Rprofile so the change to existing programs can be avoided.

…s into setDTthreads

jangorecki · 2019-03-01T11:43:51Z

There are some failures on linux machines provided by gitlab, they might be single cpu instances, not sure.
https://gitlab.com/Rdatatable/data.table/-/jobs/169970092

 getDTthreads()='omp_get_num_procs()==1; R_DATATABLE_NUM_PROCS_PERCENT=="" (default 50); R_DATATABLE_NUM_THREADS==""; omp_get_thread_limit()==2147483647; omp_get_max_threads()==1; OMP_THREAD_LIMIT==""; OMP_NUM_THREADS==""; data.table is using 1 threads. This is set on startup, and by setDTthreads(). See ?setDTthreads.; RestoreAfterFork==true']. Search tests/tests.Rraw for test numbers: 1997.09, 1997.11.

db-bench is running on 50% threads as expected

mattdowle · 2019-03-02T09:44:11Z

@jangorecki Very interesting thanks! Maybe something like 80% is optimal on the server?

MichaelChirico · 2019-03-02T10:51:34Z

certainly for full disclosure insights we learn about tuning here should be included in the benchmarks page actually this is one of the toughest parts of using e.g. spark -- picking levels of any of 30-40 hyperparameters for tuning your job without much in the way of actionable guidance -- so anything we can offer there would also be a big social benefit. also, one could argue that if we have done this tuning for data.table and not the other packages, were giving ourselves an unfair leg up...

…

On Sat, Mar 2, 2019, 5:44 PM Matt Dowle ***@***.***> wrote: @jangorecki <https://github.com/jangorecki> Very interesting thanks! Maybe something like 80% is optimal on the server? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3435 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdYvfjuZYSk3729wnSljNGcCLca4Hks5vSkfugaJpZM4bV6O6> .

jangorecki · 2019-03-02T13:35:21Z

@MichaelChirico we will use all cores in benchmark, where available, same as for spark and others. I doubt using less than 100% did ever result in better performance in that environment. There are generally no other processes running in the background.

HughParsonage · 2019-03-02T14:05:57Z

@jangorecki Those numbers don't look significant to me. Even with 10 billion groups the differences are only fractions of a second. And this is under optimal setting for parallelism: no other processes, no nested parallelism.

I get no significant differences with a 'normal' setup (i.e. some other processes running) even with the aggregations most suited to parallelization. That is, when testing execution times of the same operation, sometimes the parallel version is faster, sometimes it's slower. Never more than 1% or so.

My argument for single-core default is that the small gains under favourable conditions are outweighed by the risk of enormous performance regressions under unforeseen unfavourable conditions.

jangorecki · 2019-03-02T14:42:01Z

Those numbers are up to 1 billion. Differences are up to 29% - using 50% logical cores, not huge difference, for ordinary production 50% logical cores will good enough, but for benchmark we should aim for fastest possible option. Will run benchmarks using single core so we will have bigger picture what would be the performance cost of such change.

mattdowle · 2019-03-02T17:30:00Z

@HughParsonage the columns that look like fractions of seconds are actually percentages, I think.
@jangorecki - can you scale by 100 and use a % in that output please. And where it's seconds use "s" or "secs" so there's no doubt.

jangorecki · 2019-03-03T04:35:01Z

be8c7f6 20 cores (100% of logical cores) vs 1638c76 1 core.
We can see that in best case single core is 43% slower, in worst case it is slower more than 5 times.

top 5 speed up - or actually top 5 smallest slow down

|data           |question          | be8c7f6| 1638c76|time_change_pct |
|:--------------|:-----------------|-------:|-------:|:---------------|
|G1_1e7_1e1_0_0 |mean v1:v3 by id4 |   0.230|   0.331|43.9%           |
|G1_1e7_2e0_0_0 |mean v1:v3 by id4 |   0.266|   0.393|47.7%           |
|G1_1e8_1e1_0_0 |mean v1:v3 by id4 |   2.037|   3.371|65.5%           |
|G1_1e9_1e1_0_0 |mean v1:v3 by id4 |  20.152|  33.797|67.7%           |
|G1_1e7_1e1_0_0 |sum v1 by id1     |   0.143|   0.267|86.7%           |

top 5 slow down

|data           |question              | be8c7f6| 1638c76|time_change_pct |
|:--------------|:---------------------|-------:|-------:|:---------------|
|G1_1e9_1e2_0_1 |sum v1 by id1:id2     |   3.493|  22.344|539.7%          |
|G1_1e9_1e2_0_1 |sum v1 by id1         |   2.835|  17.907|531.6%          |
|G1_1e7_1e2_0_1 |sum v1 mean v3 by id3 |   0.119|   0.749|529.4%          |
|G1_1e7_1e2_0_1 |sum v1:v3 by id6      |   0.135|   0.814|503.0%          |
|G1_1e8_1e2_0_1 |sum v1 by id1:id2     |   0.375|   2.258|502.1%          |

by question

|question              |mean_time_change_pct |
|:---------------------|:--------------------|
|sum v1 by id1         |221.8%               |
|sum v1 by id1:id2     |289.9%               |
|sum v1 mean v3 by id3 |366.4%               |
|mean v1:v3 by id4     |94.7%                |
|sum v1:v3 by id6      |382.0%               |

by data

|data           |mean_time_change_pct |
|:--------------|:--------------------|
|G1_1e7_1e2_0_0 |315.9%               |
|G1_1e7_1e1_0_0 |220.9%               |
|G1_1e7_2e0_0_0 |108.9%               |
|G1_1e7_1e2_0_1 |388.8%               |
|G1_1e8_1e2_0_0 |317.6%               |
|G1_1e8_1e1_0_0 |207.9%               |
|G1_1e8_2e0_0_0 |141.3%               |
|G1_1e8_1e2_0_1 |395.9%               |
|G1_1e9_1e2_0_0 |307.2%               |
|G1_1e9_1e1_0_0 |243.9%               |
|G1_1e9_2e0_0_0 |160.4%               |
|G1_1e9_1e2_0_1 |411.6%               |

default half logical cpus

5b6ded1

mattdowle added this to the 1.12.2 milestone Feb 28, 2019

mattdowle requested review from renkun-ken, jangorecki and HughParsonage February 28, 2019 03:06

mattdowle changed the title ~~default half logical cpus~~ default reduced to half of all logical CPUs Feb 28, 2019

jangorecki reviewed Feb 28, 2019

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

man/openmp-utils.Rd Outdated Show resolved Hide resolved

renkun-ken requested a review from jangorecki February 28, 2019 04:51

renkun-ken approved these changes Feb 28, 2019

View reviewed changes

HughParsonage approved these changes Feb 28, 2019

View reviewed changes

jangorecki approved these changes Feb 28, 2019

View reviewed changes

mattdowle added 3 commits March 1, 2019 01:13

Added 2 R_DATATABLE_* env variables and moved limits from getDTthread…

662d165

…s into setDTthreads

news item and coverage

8f7ac2c

coverage

324b4b2

mattdowle merged commit 1638c76 into master Mar 1, 2019

mattdowle deleted the default_nthread branch March 1, 2019 11:16

mattdowle mentioned this pull request Mar 1, 2019

Rerun repeated uniqueN test #3438

Closed

mattdowle added a commit that referenced this pull request Mar 1, 2019

pass 3 tests (1895, 1997.09 and 1997.11) when OpenMP is missing; #3435

8105bae

This comment has been minimized.

Sign in to view

This was referenced Mar 5, 2019

do not limit DTthreads to number of cores? #2914

Closed

Stress-test threading/batchSize is always on #3205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

default reduced to half of all logical CPUs #3435

default reduced to half of all logical CPUs #3435

mattdowle commented Feb 28, 2019 •

edited

Loading

codecov bot commented Feb 28, 2019

codecov bot commented Feb 28, 2019 •

edited

Loading

renkun-ken left a comment

mattdowle commented Feb 28, 2019

HughParsonage commented Feb 28, 2019

jangorecki commented Feb 28, 2019 •

edited

Loading

mattdowle commented Feb 28, 2019 •

edited

Loading

jangorecki commented Mar 1, 2019

jangorecki commented Mar 1, 2019 •

edited

Loading

This comment has been minimized.

mattdowle commented Mar 2, 2019

MichaelChirico commented Mar 2, 2019 via email

jangorecki commented Mar 2, 2019 •

edited

Loading

HughParsonage commented Mar 2, 2019

jangorecki commented Mar 2, 2019 •

edited

Loading

mattdowle commented Mar 2, 2019

jangorecki commented Mar 3, 2019

default reduced to half of all logical CPUs #3435

default reduced to half of all logical CPUs #3435

Conversation

mattdowle commented Feb 28, 2019 • edited Loading

codecov bot commented Feb 28, 2019

Codecov Report

codecov bot commented Feb 28, 2019 • edited Loading

Codecov Report

renkun-ken left a comment

Choose a reason for hiding this comment

mattdowle commented Feb 28, 2019

HughParsonage commented Feb 28, 2019

jangorecki commented Feb 28, 2019 • edited Loading

mattdowle commented Feb 28, 2019 • edited Loading

jangorecki commented Mar 1, 2019

jangorecki commented Mar 1, 2019 • edited Loading

This comment has been minimized.

mattdowle commented Mar 2, 2019

MichaelChirico commented Mar 2, 2019 via email

jangorecki commented Mar 2, 2019 • edited Loading

HughParsonage commented Mar 2, 2019

jangorecki commented Mar 2, 2019 • edited Loading

mattdowle commented Mar 2, 2019

jangorecki commented Mar 3, 2019

mattdowle commented Feb 28, 2019 •

edited

Loading

codecov bot commented Feb 28, 2019 •

edited

Loading

jangorecki commented Feb 28, 2019 •

edited

Loading

mattdowle commented Feb 28, 2019 •

edited

Loading

jangorecki commented Mar 1, 2019 •

edited

Loading

jangorecki commented Mar 2, 2019 •

edited

Loading

jangorecki commented Mar 2, 2019 •

edited

Loading