Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread oversubscription for CPU-only TensorFlow installations (unless $OMP_NUM_THREADS is set) #2577

Closed
guimou opened this issue Sep 14, 2021 · 14 comments · Fixed by #2583
Closed
Milestone

Comments

@guimou
Copy link

guimou commented Sep 14, 2021

Tensorflow module @2.6.0-foss-2021a (at least) is significantly slower than its pip installed counterpart. Several tests were done (see [1]), here is a summary.

All tests were done on the same machine from a JupyterLab environment running in a container with Python 3.9.7. Container was restarted before every test, so fully reinitialized environment. Same version of TF is used for module and pip (2.6.0). Full test code, results, as well as perf runs results available [2].

Test 1: basic TF python script tensorflow_test.py

  • With TF module: average epoch time 4s / 2ms per step. CPUs running @100%
  • With pip-installed TF: average epoch time 1s / 625µs per step. CPUs running @~50%
  • With TF module setting OMP_NUM_THREADS=1 before the run: 1s / 618µs per step. CPUs running @~50%

Test 2: more demanding TF python script addition_rnn.py

  • With TF module: average epoch time 12s / 9ms per step. CPUs running @100%
  • With pip-installed TF: average epoch time 7s / 5ms per step. CPUs running @~50%
  • With TF module setting OMP_NUM_THREADS=1 before the run: 6s / 4ms per step. CPUs running @~50%

So clearly the TF module is really slow by default, but OMP_NUM_THREADS=1 brings it on par and even slightly better than pip version.
Should it be considered as a default setting, or the module be compiled differently?

[1] https://easybuild.slack.com/archives/C34UA1HT7/p1631561763481600
[2] tensorflow_benchmark.tar.gz

@boegel
Copy link
Member

boegel commented Sep 14, 2021

@Flamefire Anything to add here? You mentioned something about our builds using OpenMP, while the pre-built pip wheels are using a threadpool?

We can of course consider setting OMP_NUM_THREADS to 1 in the TensorFlow modules, but I'm not sure that's a good idea...

@boegel
Copy link
Member

boegel commented Sep 14, 2021

oneapi-src/oneDNN#342 has some relevant information on this, we're not the only ones hitting this...

It's worth noting that with having $OMP_NUM_THREADS unset, you get:

Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.

while with $OMP_NUM_THREADS set to 1, you get:

Creating new thread pool with default inter op setting: 24. Tune using inter_op_parallelism_threads for best performance.

I'm seeing even better performance when using $OMP_NUM_THREADS set to 2 which yields:

Creating new thread pool with default inter op setting: 12. Tune using inter_op_parallelism_threads for best performance.

This was on a dual-socket AMD Rome 48-core system (so 96 cores in total), in a Slurm job where 24 cores are available.

So while the problem can be mitigated via $OMP_NUM_THREADS (if you realize there is a problem...), it's essentially TensorFlow doing something stupid when $OMP_NUM_THREADS is left unset. It's essentially overloading the cores by forking twice (?) as many threads than there are cores available, and then performance is killed due to an increase in context switches.

At least, that's in the configuration we build it in, there's a different threading module you can build mkl-dnn or oneDNN with, see DNNL_CPU_RUNTIME at https://github.com/oneapi-src/oneDNN#linux.

@boegel
Copy link
Member

boegel commented Sep 14, 2021

@boegel
Copy link
Member

boegel commented Sep 14, 2021

More useful info in tensorflow/tensorflow#29968 .

There's a small bit of Python code shared there to read out /proc/<PID>/status that you can add to your script to check whether you're massively overloading the available cores by starting too many threads which leads to a spike in context switches.

I tweaked it a bit to only show the relevant info:

import os
with open(os.path.join('/proc', str(os.getpid()), 'status')) as fp:
        lines = fp.read().split('\n')
        print('\n'.join(x for x in lines if x.startswith('Threads') or 'ctxt' in x))

If you include this both at the top of the script, and add the bottom, you can tell how many threads were started, and determine the amount of context switches done.

For me, in a 24-core Slurm job on and AMD Rome system:

  • without $OMP_NUM_THREADS set: 99 threads, ~13k voluntary_ctxt_switches, 203 nonvoluntary_ctxt_switches
  • with $OMP_NUM_THREADS set to 1 (equivalent with inter_op_parallelism_threads=24?): 52 threads, ~13k voluntary_ctxt_switches, 54nonvoluntary_ctxt_switches
  • with $OMP_NUM_THREADS set to 2 (equivalent with inter_op_parallelism_threads=12, gives slightly better performance): 53 threads, ~13k voluntary_ctxt_switches, 58 nonvoluntary_ctxt_switches

So even with $OMP_NUM_THREADS set to 1, there's more threads than there should be, so there may be room for further improvement.

@boegel
Copy link
Member

boegel commented Sep 14, 2021

To get full control over the number of threads, you can also use $TF_NUM_INTEROP_THREADS and $TF_NUM_INTRAOP_THREADS, but the $OMP_NUM_THREADS value also remains important.

There's a complex play going on between those 3, and although the TensorFlow code says "Setting inter_op conservatively to avoid thread oversubscription", that clearly isn't working out very well...

Best result I obtained in my 24-core Slurm job on an AMD Rome system was with:

TF_NUM_INTEROP_THREADS=2
TF_NUM_INTRAOP_THREADS=1
OMP_NUM_THREADS=6

That resulted in 22 threads, and gave 741-772 us/step (vs ~1ms/step with only $OMP_NUM_THREADS set to 1 (52 threads), and 2-3ms/step with $OMP_NUM_THREADS unset (99 threads).

No doubt the best combo is heavily dependent on the script and the system configuration (which CPUs, # cores, etc.)...

So, to conclude: should we let the TensorFlow modules set $OMP_NUM_THREADS to 1 (unless otherwise specified), perhaps with a warning along with it when the module gets loaded?

@akesandgren
Copy link
Contributor

akesandgren commented Sep 14, 2021

Something along the lines we (HPC2N) already do for openblas. It sets OMP_NUM_THREADS if not already set.

@guimou
Copy link
Author

guimou commented Sep 14, 2021

Definitely not an expert, but even without tweaking TF_NUM_INTEROP_THREADS and TF_NUM_INTRAOP_THREADS, setting OMP_NUM_THREADS to 1 from the start gives already something that it at least on-par with standard pip package. Which is something an "end-user" like me would at least like to achieve, even if there is further room for improvement depending on the underlying infrastructure.

@boegel
Copy link
Member

boegel commented Sep 14, 2021

The proper fix is probably to switch to build TensorFlow >= 2.3.0 with --config=mkl_threadpool rather than just --config=mkl like we do now, which is basically what was done for the pip version.

Then there's no need anymore to hard set $OMP_NUM_THREADS to 1.
Although that helps, it's an environment variable that's way too broadly applicable, it may leads to running single-core for other tools being used in the same session, which is also not what you want.

See https://software.intel.com/content/www/us/en/develop/articles/deep-learning-with-avx512-and-dl-boost.html#inpage-nav-4-4 and mainly tensorflow/tensorflow#47279 (where they switched to this by default for TensorFlow >= 2.5.0).

I'm sure @Flamefire has some input on this too. :)

@boegel boegel transferred this issue from easybuilders/easybuild Sep 14, 2021
@boegel boegel added this to the next release (4.4.3?) milestone Sep 14, 2021
@Flamefire
Copy link
Contributor

Flamefire commented Sep 15, 2021

Then there's no need anymore to hard set $OMP_NUM_THREADS to 1.
Although that helps, it's an environment variable that's way too broadly applicable, it may leads to running single-core for other tools being used in the same session, which is also not what you want.

I'd rather patch TF to set this somewhere because setting this env var globally is likely not a good idea as many users simply load all their tools/modules and then use multiple ones successively.

The message Creating new thread pool ... not coming up in pip builds is a good hint. It is coming from this code: https://github.com/tensorflow/tensorflow/blob/v2.6.0/tensorflow/core/common_runtime/process_util.cc#L131-L132
I.e. it will be shown if inter_op-threads and the env var for it is not set, MKL is enabled and ENABLE_ONEDNN_OPENMP is defined.

Checking how the TF pip package is build I found https://github.com/tensorflow/tensorflow/blob/v2.6.0/.bazelrc#L578 Following the links one can see that TF is build with AVX (not AVX2 or similar) which also explains the This TensorFlow binary is optimized [...] CPU instructions in performance-critical operations: AVX2 FMA. To enable them [...] message.
Most importantly: I don't see that they enable MKL(!)

Further investigation to verify this leads to https://github.com/tensorflow/tensorflow/blob/v2.6.0/tensorflow/core/util/util.cc#L129

--> TF pip does not pass --config=mkl and not even --config=mkl_threadpool.

The difference between those 2 is that the former defines build_with_openmp=true and the latter build_with_mkl_opensource=true, the rest being equal.
This means using the latter results in ENABLE_ONEDNN_OPENMP not being defined, and no references to libiomp5.so from the llvm repo.

Maybe we should simply NOT build with --config=mkl?
This would also affect ARM: #2574 because --config=mkl_aarch64 is basically --config=mkl with no references to the libiomp5.so but still using OpenMP
For ARM we could instead use --define build_with_mkl_aarch64=true to match the x86 behavior without --config=mkl

Pip version with Python/3.9.5-GCCcore-10.3.0 module and the module version without mkl-config now both perform at about 900us/step

@akesandgren
Copy link
Contributor

akesandgren commented Sep 16, 2021

If I build TF 2.6.0 with 'with_mkl_dnn': False, it will fail with this on a AMD EPYC:

At least 1 cpu tests failed:
//tensorflow/core/kernels/mkl:mkl_fused_batch_norm_op_test (took 5 hours 22 mins 27 secs)

That was with EB 4.4.2's easyblock, and develops is identical at the moment.

@Flamefire
Copy link
Contributor

Flamefire commented Sep 16, 2021

@akesandgren This is expected. IIRC the test checks for transformations using MKL but does not take into account that those are disabled. Double-checking. See related issue tensorflow/tensorflow#46532

That is bad. It looks like many actual failures with failed comparisons and even generated NaNs.
For --config=mkl_threadpool it is even worse: Many tests fail with mismatches, SegFaukls, FloatingPoint Exceptions...

@Flamefire
Copy link
Contributor

Asked at the SIG Build group: https://groups.google.com/a/tensorflow.org/g/build/c/RZhgZst-fgQ

@akesandgren
Copy link
Contributor

So what should be our way forward on this?

@Flamefire
Copy link
Contributor

Drop the --config=mkl for TF 2.4+ as MKL is no longer used by TF and the flag changed its meaning. Besides that I guess the with_mkl_dnn wasn't exactly correct anyway. At least for 2.4+ adding --config=mkl is not required to honor enabling mklDNN/oneDNN. Prior to that I don't know.

@boegel boegel changed the title Tensorflow module performance issue (much slower than pip version) thread oversubscription for CPU-only TensorFlow installations (unless $OMP_NUM_THREADS is set) Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants