thread oversubscription for CPU-only TensorFlow installations (unless $OMP_NUM_THREADS is set) #2577

guimou · 2021-09-14T15:09:45Z

Tensorflow module @2.6.0-foss-2021a (at least) is significantly slower than its pip installed counterpart. Several tests were done (see [1]), here is a summary.

All tests were done on the same machine from a JupyterLab environment running in a container with Python 3.9.7. Container was restarted before every test, so fully reinitialized environment. Same version of TF is used for module and pip (2.6.0). Full test code, results, as well as perf runs results available [2].

Test 1: basic TF python script tensorflow_test.py

With TF module: average epoch time 4s / 2ms per step. CPUs running @100%
With pip-installed TF: average epoch time 1s / 625µs per step. CPUs running @~50%
With TF module setting OMP_NUM_THREADS=1 before the run: 1s / 618µs per step. CPUs running @~50%

Test 2: more demanding TF python script addition_rnn.py

With TF module: average epoch time 12s / 9ms per step. CPUs running @100%
With pip-installed TF: average epoch time 7s / 5ms per step. CPUs running @~50%
With TF module setting OMP_NUM_THREADS=1 before the run: 6s / 4ms per step. CPUs running @~50%

So clearly the TF module is really slow by default, but OMP_NUM_THREADS=1 brings it on par and even slightly better than pip version.
Should it be considered as a default setting, or the module be compiled differently?

[1] https://easybuild.slack.com/archives/C34UA1HT7/p1631561763481600
[2] tensorflow_benchmark.tar.gz

The text was updated successfully, but these errors were encountered:

boegel · 2021-09-14T15:31:41Z

@Flamefire Anything to add here? You mentioned something about our builds using OpenMP, while the pre-built pip wheels are using a threadpool?

We can of course consider setting OMP_NUM_THREADS to 1 in the TensorFlow modules, but I'm not sure that's a good idea...

boegel · 2021-09-14T16:12:11Z

oneapi-src/oneDNN#342 has some relevant information on this, we're not the only ones hitting this...

It's worth noting that with having $OMP_NUM_THREADS unset, you get:

Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.

while with $OMP_NUM_THREADS set to 1, you get:

Creating new thread pool with default inter op setting: 24. Tune using inter_op_parallelism_threads for best performance.

I'm seeing even better performance when using $OMP_NUM_THREADS set to 2 which yields:

Creating new thread pool with default inter op setting: 12. Tune using inter_op_parallelism_threads for best performance.

This was on a dual-socket AMD Rome 48-core system (so 96 cores in total), in a Slurm job where 24 cores are available.

So while the problem can be mitigated via $OMP_NUM_THREADS (if you realize there is a problem...), it's essentially TensorFlow doing something stupid when $OMP_NUM_THREADS is left unset. It's essentially overloading the cores by forking twice (?) as many threads than there are cores available, and then performance is killed due to an increase in context switches.

At least, that's in the configuration we build it in, there's a different threading module you can build mkl-dnn or oneDNN with, see DNNL_CPU_RUNTIME at https://github.com/oneapi-src/oneDNN#linux.

boegel · 2021-09-14T16:26:08Z

More relevant info: https://stackoverflow.com/questions/41233635/meaning-of-inter-op-parallelism-threads-and-intra-op-parallelism-threads

boegel · 2021-09-14T16:53:48Z

More useful info in tensorflow/tensorflow#29968 .

There's a small bit of Python code shared there to read out /proc/<PID>/status that you can add to your script to check whether you're massively overloading the available cores by starting too many threads which leads to a spike in context switches.

I tweaked it a bit to only show the relevant info:

import os
with open(os.path.join('/proc', str(os.getpid()), 'status')) as fp:
        lines = fp.read().split('\n')
        print('\n'.join(x for x in lines if x.startswith('Threads') or 'ctxt' in x))

If you include this both at the top of the script, and add the bottom, you can tell how many threads were started, and determine the amount of context switches done.

For me, in a 24-core Slurm job on and AMD Rome system:

without $OMP_NUM_THREADS set: 99 threads, ~13k voluntary_ctxt_switches, 203 nonvoluntary_ctxt_switches
with $OMP_NUM_THREADS set to 1 (equivalent with inter_op_parallelism_threads=24?): 52 threads, ~13k voluntary_ctxt_switches, 54nonvoluntary_ctxt_switches
with $OMP_NUM_THREADS set to 2 (equivalent with inter_op_parallelism_threads=12, gives slightly better performance): 53 threads, ~13k voluntary_ctxt_switches, 58 nonvoluntary_ctxt_switches

So even with $OMP_NUM_THREADS set to 1, there's more threads than there should be, so there may be room for further improvement.

boegel · 2021-09-14T17:12:29Z

To get full control over the number of threads, you can also use $TF_NUM_INTEROP_THREADS and $TF_NUM_INTRAOP_THREADS, but the $OMP_NUM_THREADS value also remains important.

There's a complex play going on between those 3, and although the TensorFlow code says "Setting inter_op conservatively to avoid thread oversubscription", that clearly isn't working out very well...

Best result I obtained in my 24-core Slurm job on an AMD Rome system was with:

TF_NUM_INTEROP_THREADS=2
TF_NUM_INTRAOP_THREADS=1
OMP_NUM_THREADS=6

That resulted in 22 threads, and gave 741-772 us/step (vs ~1ms/step with only $OMP_NUM_THREADS set to 1 (52 threads), and 2-3ms/step with $OMP_NUM_THREADS unset (99 threads).

No doubt the best combo is heavily dependent on the script and the system configuration (which CPUs, # cores, etc.)...

So, to conclude: should we let the TensorFlow modules set $OMP_NUM_THREADS to 1 (unless otherwise specified), perhaps with a warning along with it when the module gets loaded?

akesandgren · 2021-09-14T17:20:29Z

Something along the lines we (HPC2N) already do for openblas. It sets OMP_NUM_THREADS if not already set.

guimou · 2021-09-14T17:35:14Z

Definitely not an expert, but even without tweaking TF_NUM_INTEROP_THREADS and TF_NUM_INTRAOP_THREADS, setting OMP_NUM_THREADS to 1 from the start gives already something that it at least on-par with standard pip package. Which is something an "end-user" like me would at least like to achieve, even if there is further room for improvement depending on the underlying infrastructure.

boegel · 2021-09-14T17:56:22Z

The proper fix is probably to switch to build TensorFlow >= 2.3.0 with --config=mkl_threadpool rather than just --config=mkl like we do now, which is basically what was done for the pip version.

Then there's no need anymore to hard set $OMP_NUM_THREADS to 1.
Although that helps, it's an environment variable that's way too broadly applicable, it may leads to running single-core for other tools being used in the same session, which is also not what you want.

See https://software.intel.com/content/www/us/en/develop/articles/deep-learning-with-avx512-and-dl-boost.html#inpage-nav-4-4 and mainly tensorflow/tensorflow#47279 (where they switched to this by default for TensorFlow >= 2.5.0).

I'm sure @Flamefire has some input on this too. :)

Flamefire · 2021-09-15T10:55:55Z

Then there's no need anymore to hard set $OMP_NUM_THREADS to 1.
Although that helps, it's an environment variable that's way too broadly applicable, it may leads to running single-core for other tools being used in the same session, which is also not what you want.

I'd rather patch TF to set this somewhere because setting this env var globally is likely not a good idea as many users simply load all their tools/modules and then use multiple ones successively.

The message Creating new thread pool ... not coming up in pip builds is a good hint. It is coming from this code: https://github.com/tensorflow/tensorflow/blob/v2.6.0/tensorflow/core/common_runtime/process_util.cc#L131-L132
I.e. it will be shown if inter_op-threads and the env var for it is not set, MKL is enabled and ENABLE_ONEDNN_OPENMP is defined.

Checking how the TF pip package is build I found https://github.com/tensorflow/tensorflow/blob/v2.6.0/.bazelrc#L578 Following the links one can see that TF is build with AVX (not AVX2 or similar) which also explains the This TensorFlow binary is optimized [...] CPU instructions in performance-critical operations: AVX2 FMA. To enable them [...] message.
Most importantly: I don't see that they enable MKL(!)

Further investigation to verify this leads to https://github.com/tensorflow/tensorflow/blob/v2.6.0/tensorflow/core/util/util.cc#L129

The string TF_DISABLE_MKL is not contained in the TF installation but TF_ENABLE_ONEDNN_OPTS is (in libtensorflow_framework.so.2), hence ENABLE_MKL is indeed NOT defined
However as the whole function is not optimized away INTEL_MKL must be defined
Given the check at https://github.com/tensorflow/tensorflow/blob/v2.6.0/tensorflow/core/util/mkl_threadpool.h#L36 and the class at https://github.com/tensorflow/tensorflow/blob/v2.6.0/tensorflow/core/util/mkl_threadpool.h#L63 together with a strings python/_pywrap_tensorflow_internal.so | grep MklDnnThreadPool check concludes that ENABLE_ONEDNN_OPENMP is also NOT defined
Those defines are set at https://github.com/tensorflow/tensorflow/blob/v2.6.0/tensorflow/tensorflow.bzl#L375-L380
INTEL_MKL is basically set for x86 platforms with no switch: https://github.com/tensorflow/tensorflow/blob/v2.6.0/third_party/mkl/build_defs.bzl#L35-L38
ENABLE_MKL is set with the --config=mkl switch (defines enable_mkl and build_with_mkl): https://github.com/tensorflow/tensorflow/blob/v2.6.0/third_party/mkl/build_defs.bzl#L88
ENABLE_ONEDNN_OPENMP is set with --config=mkl switch (defines build_with_mkl and build_with_openmp): https://github.com/tensorflow/tensorflow/blob/v2.6.0/third_party/mkl_dnn/BUILD#L24-L25 from https://github.com/tensorflow/tensorflow/blob/v2.6.0/third_party/mkl_dnn/build_defs.bzl#L29
for the --config=mkl option see https://github.com/tensorflow/tensorflow/blob/v2.6.0/.bazelrc#L207-L210

--> TF pip does not pass --config=mkl and not even --config=mkl_threadpool.

The difference between those 2 is that the former defines build_with_openmp=true and the latter build_with_mkl_opensource=true, the rest being equal.
This means using the latter results in ENABLE_ONEDNN_OPENMP not being defined, and no references to libiomp5.so from the llvm repo.

Maybe we should simply NOT build with --config=mkl?
This would also affect ARM: #2574 because --config=mkl_aarch64 is basically --config=mkl with no references to the libiomp5.so but still using OpenMP
For ARM we could instead use --define build_with_mkl_aarch64=true to match the x86 behavior without --config=mkl

Pip version with Python/3.9.5-GCCcore-10.3.0 module and the module version without mkl-config now both perform at about 900us/step

akesandgren · 2021-09-16T05:12:46Z

If I build TF 2.6.0 with 'with_mkl_dnn': False, it will fail with this on a AMD EPYC:

At least 1 cpu tests failed:
//tensorflow/core/kernels/mkl:mkl_fused_batch_norm_op_test (took 5 hours 22 mins 27 secs)

That was with EB 4.4.2's easyblock, and develops is identical at the moment.

Flamefire · 2021-09-16T08:11:20Z

~~@akesandgren This is expected. IIRC the test checks for transformations using MKL but does not take into account that those are disabled. Double-checking. See related issue tensorflow/tensorflow#46532~~

That is bad. It looks like many actual failures with failed comparisons and even generated NaNs.
For --config=mkl_threadpool it is even worse: Many tests fail with mismatches, SegFaukls, FloatingPoint Exceptions...

Flamefire · 2021-09-20T08:02:51Z

Asked at the SIG Build group: https://groups.google.com/a/tensorflow.org/g/build/c/RZhgZst-fgQ

akesandgren · 2021-09-24T10:28:42Z

So what should be our way forward on this?

Flamefire · 2021-09-24T11:01:16Z

Drop the --config=mkl for TF 2.4+ as MKL is no longer used by TF and the flag changed its meaning. Besides that I guess the with_mkl_dnn wasn't exactly correct anyway. At least for 2.4+ adding --config=mkl is not required to honor enabling mklDNN/oneDNN. Prior to that I don't know.

boegel transferred this issue from easybuilders/easybuild Sep 14, 2021

boegel added the problem report label Sep 14, 2021

boegel added this to the next release (4.4.3?) milestone Sep 14, 2021

Flamefire mentioned this issue Sep 16, 2021

enable MKL-DNN/oneDNN usage for aarch64 CPUs in TensorFlow 2.5+ #2574

Closed

Flamefire mentioned this issue Sep 24, 2021

don't use --config=mkl for TensorFlow 2.4+ #2583

Merged

boegel changed the title ~~Tensorflow module performance issue (much slower than pip version)~~ thread oversubscription for CPU-only TensorFlow installations (unless $OMP_NUM_THREADS is set) Oct 13, 2021

boegel closed this as completed in #2583 Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thread oversubscription for CPU-only TensorFlow installations (unless $OMP_NUM_THREADS is set) #2577

thread oversubscription for CPU-only TensorFlow installations (unless $OMP_NUM_THREADS is set) #2577

guimou commented Sep 14, 2021

boegel commented Sep 14, 2021

boegel commented Sep 14, 2021

boegel commented Sep 14, 2021

boegel commented Sep 14, 2021 •

edited by akesandgren

Loading

boegel commented Sep 14, 2021

akesandgren commented Sep 14, 2021 •

edited

Loading

guimou commented Sep 14, 2021

boegel commented Sep 14, 2021

Flamefire commented Sep 15, 2021 •

edited

Loading

akesandgren commented Sep 16, 2021 •

edited

Loading

Flamefire commented Sep 16, 2021 •

edited

Loading

Flamefire commented Sep 20, 2021

akesandgren commented Sep 24, 2021

Flamefire commented Sep 24, 2021

thread oversubscription for CPU-only TensorFlow installations (unless $OMP_NUM_THREADS is set) #2577

thread oversubscription for CPU-only TensorFlow installations (unless $OMP_NUM_THREADS is set) #2577

Comments

guimou commented Sep 14, 2021

boegel commented Sep 14, 2021

boegel commented Sep 14, 2021

boegel commented Sep 14, 2021

boegel commented Sep 14, 2021 • edited by akesandgren Loading

boegel commented Sep 14, 2021

akesandgren commented Sep 14, 2021 • edited Loading

guimou commented Sep 14, 2021

boegel commented Sep 14, 2021

Flamefire commented Sep 15, 2021 • edited Loading

akesandgren commented Sep 16, 2021 • edited Loading

Flamefire commented Sep 16, 2021 • edited Loading

Flamefire commented Sep 20, 2021

akesandgren commented Sep 24, 2021

Flamefire commented Sep 24, 2021

boegel commented Sep 14, 2021 •

edited by akesandgren

Loading

akesandgren commented Sep 14, 2021 •

edited

Loading

Flamefire commented Sep 15, 2021 •

edited

Loading

akesandgren commented Sep 16, 2021 •

edited

Loading

Flamefire commented Sep 16, 2021 •

edited

Loading