-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thread oversubscription for CPU-only TensorFlow installations (unless $OMP_NUM_THREADS is set) #2577
Comments
@Flamefire Anything to add here? You mentioned something about our builds using OpenMP, while the pre-built We can of course consider setting |
oneapi-src/oneDNN#342 has some relevant information on this, we're not the only ones hitting this... It's worth noting that with having
while with
I'm seeing even better performance when using
This was on a dual-socket AMD Rome 48-core system (so 96 cores in total), in a Slurm job where 24 cores are available. So while the problem can be mitigated via At least, that's in the configuration we build it in, there's a different threading module you can build |
More useful info in tensorflow/tensorflow#29968 . There's a small bit of Python code shared there to read out I tweaked it a bit to only show the relevant info: import os
with open(os.path.join('/proc', str(os.getpid()), 'status')) as fp:
lines = fp.read().split('\n')
print('\n'.join(x for x in lines if x.startswith('Threads') or 'ctxt' in x)) If you include this both at the top of the script, and add the bottom, you can tell how many threads were started, and determine the amount of context switches done. For me, in a 24-core Slurm job on and AMD Rome system:
So even with |
To get full control over the number of threads, you can also use There's a complex play going on between those 3, and although the TensorFlow code says " Best result I obtained in my 24-core Slurm job on an AMD Rome system was with:
That resulted in 22 threads, and gave No doubt the best combo is heavily dependent on the script and the system configuration (which CPUs, # cores, etc.)... So, to conclude: should we let the TensorFlow modules set |
Something along the lines we (HPC2N) already do for openblas. It sets OMP_NUM_THREADS if not already set. |
Definitely not an expert, but even without tweaking TF_NUM_INTEROP_THREADS and TF_NUM_INTRAOP_THREADS, setting OMP_NUM_THREADS to 1 from the start gives already something that it at least on-par with standard pip package. Which is something an "end-user" like me would at least like to achieve, even if there is further room for improvement depending on the underlying infrastructure. |
The proper fix is probably to switch to build TensorFlow >= 2.3.0 with Then there's no need anymore to hard set See https://software.intel.com/content/www/us/en/develop/articles/deep-learning-with-avx512-and-dl-boost.html#inpage-nav-4-4 and mainly tensorflow/tensorflow#47279 (where they switched to this by default for TensorFlow >= 2.5.0). I'm sure @Flamefire has some input on this too. :) |
I'd rather patch TF to set this somewhere because setting this env var globally is likely not a good idea as many users simply load all their tools/modules and then use multiple ones successively. The message Checking how the TF pip package is build I found https://github.com/tensorflow/tensorflow/blob/v2.6.0/.bazelrc#L578 Following the links one can see that TF is build with AVX (not AVX2 or similar) which also explains the Further investigation to verify this leads to https://github.com/tensorflow/tensorflow/blob/v2.6.0/tensorflow/core/util/util.cc#L129
--> TF pip does not pass The difference between those 2 is that the former defines Maybe we should simply NOT build with Pip version with Python/3.9.5-GCCcore-10.3.0 module and the module version without mkl-config now both perform at about 900us/step |
If I build TF 2.6.0 with 'with_mkl_dnn': False, it will fail with this on a AMD EPYC:
That was with EB 4.4.2's easyblock, and develops is identical at the moment. |
That is bad. It looks like many actual failures with failed comparisons and even generated NaNs. |
Asked at the SIG Build group: https://groups.google.com/a/tensorflow.org/g/build/c/RZhgZst-fgQ |
So what should be our way forward on this? |
Drop the |
Tensorflow module @2.6.0-foss-2021a (at least) is significantly slower than its pip installed counterpart. Several tests were done (see [1]), here is a summary.
All tests were done on the same machine from a JupyterLab environment running in a container with Python 3.9.7. Container was restarted before every test, so fully reinitialized environment. Same version of TF is used for module and pip (2.6.0). Full test code, results, as well as perf runs results available [2].
Test 1: basic TF python script
tensorflow_test.py
Test 2: more demanding TF python script
addition_rnn.py
So clearly the TF module is really slow by default, but
OMP_NUM_THREADS=1
brings it on par and even slightly better than pip version.Should it be considered as a default setting, or the module be compiled differently?
[1] https://easybuild.slack.com/archives/C34UA1HT7/p1631561763481600
[2] tensorflow_benchmark.tar.gz
The text was updated successfully, but these errors were encountered: