-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disable use of -ftree-vectorize for OpenFOAM v2112 with foss/2021b #15495
disable use of -ftree-vectorize for OpenFOAM v2112 with foss/2021b #15495
Conversation
Test report by @Micket |
Test report by @jfgrimm |
I'm Håkan Nilsson, referred to by Mikael (Micket). I am preparing the different tests and my post-docs are doing the tests. Today I have also compiled OpenFOAM-v2112 the standard way and with the flags used by EasyBuild at the cluster Tetralith at nsc.liu.se. I have asked my colleagues to run the test there as well, just to confirm that the problem does not only appear at our cluster Vera. It is weekend, so I expect them to do it on Monday. I would also like to test it in Ubuntu, since OpenFOAM is commonly used in that platform. As seen in our tests, there was no problem with the EasyBuild installations for two previous versions of OpenFOAM. Perhaps there is some detail in v2112 that has this particular problem. I guess the tests by the developers at ESI are not done with the EasyBuild configuration, so it was not seen. Also, the simulations do not fail but just give incorrect results. I am not sure they would catch this error. My intention is to open an issue in the OpenFOAM bug reporting system, but first I would like to do a few more tests. We are also working on a new TestHarness for OpenFOAM, see https://sourceforge.net/p/turbowg/TestHarness/ci/master/tree/. Our experiences from this problem will be used in that design, so that we can hopefully identify these problems before releases in the future. We will set up some open case to test this particular issue. The one we use right now is an assignment in a course, so we don't want to release it (since we then have to construct a new assignment). However, not all combinations of compiler flags can be tested in a test harness. But I guess the EasyBuild combination should be part of a test loop, since I guess there is a reason why it has been decided that the flags in EasyBuild should be different than the ones used in the original release(?) |
There are so many combinations of compilers (GCC, Clang, Intel, Cray) with different optimization flags by default, whose defaults change from version to version, combined with just as many CPU architectures; there really is no other real alternative than to ship a test suite that can be run by the person building it. What upstream can do to make sure it builds cleanly without warnings, then possibly complement that with static analysis or putting the test suite through something like ASan (-fsanitize=address) for extra insurance.
Well, to be blunt, the default flags OF specifies aren't good. The difference between -O2 and -O3 is more often than not mostly negligible, so, EB is actually making the conservative choice |
Indeed, but speculating anyway, it's possible that In any case, from GCC 12 |
We have now done the tests at Tetralith, at nsc.liu.se (https://www.nsc.liu.se/systems/tetralith/) My own standard OpenFOAM installation with -O3 (using buildtool-easybuild/4.5.3-nsce8837e7 foss/2020b): I have asked my post-doc to do similar tests also in Ubuntu. It probably needs to fail in Ubuntu if the developers should have a look at it, since they need to be able to reproduce the problem. They can't look at it if they need access to a particular cluster. Any input to why this may happen is gratefully accepted. |
The test has now also been done in Ubuntu 18.04.6 LTS |
Although I am now out of my comfort zone, I am attaching a comment from a software engineer collaboration partner of mine, which you may also comment on: The "-ftree-vectorize " is in fact an alias for enabling both -ftree-loop-vectorize and -ftree-slp-vectorize When requesting the -O2 optimization level, those switches will toggle the "cost model used for vectorization" from "very-cheap" to "cheap". From the gcc file ./gcc/opts.cc So basically, it looks like those -ftree-loop-vectorize and -ftree-slp-vectorize options are only useful when using -O2. There is a man page entry that slightly discuss those models:
In the source code for gcc, there is not much about VECT_COST_MODEL_DYNAMIC when searching for that string. The compiler is making a difference if the model number is "smaller" or "greater" than VECT_COST_MODEL_CHEAP (those are enums in the source code), that's about it. |
Following the notes in my previous post, I have now verified that all of the following PASS at Vera: |
@drhakannilsson Does it produce correct results when built with -O0, -O1, -O2 without extra flags? All three with and without -march=native. Those are fairly important check points. |
That would take some time to test, since that is 6 full compilations and my disk is full. If I could just submit compilation of all of those at once it would be doable. @Micket - can I have access to more disk so I can more installations at the same time? Right now I have to delete or tar installations if I want to test a new one. I am installing in C3SE 2021/3-1. We have now started to set up the same tests using a standard tutorial, which can be shared. The aim is to include it in our TestHarness. If you like, I can share it as it is at the moment (although I would like to add wallShearStress as well first). I attach the figures it produces at the moment: At Vera (C3SE), where Reference is default flags in Ubuntu, of is default flags, eb is easybuild flags and C3SE is easybuild flags but with -novectorize: At Tetralith (NSC), where Reference is default flags in Ubuntu, of is default flags, eb is easybuild flags and NSC is Intel compilation: It is particularly interesting to zoom in the V-profiles just after the backward facing step. It can be seen that all results differ at Vera, while only the eb result differs at Tetralith: We have also encountered another effect. This particular tutorial puts the sampling points just at the boundary of the computational domain. That is fine with the default flags, but not with the easybuild flags and not even with the present easybuild installation at Vera with -novectorize. It is the same at Tetralith, where there is a problem with the easybuild flags but not with the original flags. In fact, also the Intel installation fails due to this, so this is a more severe problem that should definitely be reported upstream. What happens is that the code is searching for the cell labels of the sampling points, and the code gets stuck in an infinite loop. It is just strange that it depends on compiler flags. It can be solved by making sure that the sampling points are not put exactly at the boundary. |
@boegelbot please test @ generoso |
I'm fine with being a bit cautious here, and disabling Is there any way that we could run one of the OpenFOAM tutorial examples on a small number of cores (single-node) to catch this? |
@boegel: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1135509542 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
@boegel I did make a O3 build
that @drhakannilsson confirmed did not exhibit these errors, so, we can also go for that. I don't know how much this matters performance wise for OpenFOAM. I'm fine with either option. @drhakannilsson I haven't followed up on your email yet, i'll just respond here: As this works on Tetralith (which is the same type of hardware as Vera), and only seems to occur with a very particular combination of flags (and even then, not on all setups), i suspect we have reached the limit of the investigation we can do with just just playing around with compiler options. And, even if the cause for this issue is found, as @boegel indicate; we welcome tests (with automatic comparison to correct values) as it's the only way to give any confidence when building software. Heck, things like this could be compiler bugs! |
@Micket I have run a number of simulations to test the computational speed. I report the settings and numbers below. EasyBuild installation with -O3 -march=native -fno-math-errno -std=c++11 -fuse-ld=bfdExecutionTime = 116.01 s ClockTime = 118 s EasyBuild installation with -O2 -fno-tree-vectorize -march=native -fno-math-errno -std=c++11 -fuse-ld=bfdExecutionTime = 134.11 s ClockTime = 136 s We can see that -O3 seems to be slightly faster. We can also see some variation in speed. This simulation is on a single core, submitted using -n 1 -c 2. Can the variation be due to other jobs running on the same node? OpenFOAM is sensitive to the local bus memory bandwidth, it has been shown. Further related question later. I also present times from my own installations, since I want you to comment on why they are faster: Standard OpenFOAM Installation with -O3ExecutionTime = 92.73 s ClockTime = 95 s Standard OpenFOAM installation with -O2 -ftree-vectorize -march=native -fno-math-errno -std=c++11 -fuse-ld=bfdExecutionTime = 137.28 s ClockTime = 182 s In all tests you can see a quite large variation, which may be due to interference with jobs by others running on the same node, sharing the same memory bus (or something else?). I have also marked some tests with "(tail -f)", for which I looked at the log file using that command during the entire simulation. It is quite clear that it influences the computational speed A LOT. I did not expect this. Should I have expected this? |
@boegel , @Micket I attach the test procedure I am using, if you are interested to have a look at it and perhaps use it (or improve it). There is a README file that describes how to use it. It basically picks up a standard OpenFOAM tutorial and does a small modification to it to avoid a problem with probes exactly at the boundary (will be reported upstream). It also adds an additional post-processing step so we can plot cf and cp. It is a single-core simulation that takes about 95-140s (on Vera). Submit scripts for Vera and Tetralith are supplied, although not useful for everyone. Plotting is done with python3/Matplotlib. The python script also reports some details in the terminal window, which is work in progress to automatically report PASS/FAIL. If you know a good way to compare two curves and calculate a useful error in Python, please let me know. Different scalings seem to be needed for the different curves/errors. Test procedure: As mentioned before, we are working on more tests, which will be distributed through https://sourceforge.net/p/turbowg/TestHarness/ci/master/tree/, with reports at https://my.cdash.org/index.php?project=TurboWG |
If the code is memory bandwidth sensitive then you must run it with a full NUMA node allocated, i.e. one CPU socket in this case. |
This is another topic than the original one for this thread, but since it was asked for computational times... Our production simulations are hundreds of cores, and we always request full nodes to not interfer with others. A problem is that we interfer with ourselves, i.e. the processes of the same simulation fight for the same memory bus. This reduces the speed a lot. I have discussed this at SNIC Application Expert meetings and in other forums, but I have not seen any solution to this. For the particular speed tests done here, which submission flags would you propose? It is a sequential simulation. I read that Vera is set up for hyper-treading, but I am not sure if that influences sequential jobs. I guess that --exclusive gives me an entire node, and not only an entire CPU socket? |
Test report by @boegel |
I'll go ahead and merge this, since it fixes an important problem... If it's really worth moving from |
Going in, thanks @Micket! |
If your problem is memory bound (and I wouldn't actually expect it to be for typical CFD? I see lots of simulations that manage to keeping cores quite busy), then you simply need to spread the tasks across more CPUs, and simply tell script to place the tasks appropriately (i.e. number of tasks per socket). Otherwise, writing better cache aware algorithms (perhaps a block sparse matrix), or using smaller floats (thus less memory to copy).
Correct, --exclusive makes the job exclusive on the entire node, and that is fine. What Åke means is to ensure we aren't sharing the numa node with anyone else. We can let the other one go to waste while we are benchmarking just to be sure. |
(created using
eb --new-pr
)No test suite, but some trusted users indicated that there were some serious error in the results when building this for our cluster as is.
I think it's best to keep vectorize turned off, might apply to older version as well, but i don't know how to verify