Proposal for bit-reproducible numerical operations #12

certik · 2019-12-17T17:56:05Z

I would like to see greater support for bit-reproducible numerical operations. This is a very high priority for us since our models are used in weather and climate forecasting, and much of our time is devoted to bit reproducibility of our numerical calculations.

A common problem is intrinsic Fortran reduction operations like sum(), where the order is ambiguous (deliberately, one might say), and therefore not reproducible. A more serious problem for us is transcendental functions, like exp() or cos(), which will give different results for different optimizations, and we typically cannot say where it was invoked (libm? Vendor library? etc.).

A standard library may be a place to provide bit-reproducible implementations.

The text was updated successfully, but these errors were encountered:

certik · 2019-12-17T18:01:18Z

@marshallward, can you elaborate exactly what you mean by bit-reproducible numerical operations and why you need it?

Typically, even the most accurate libraries, such as https://openlibm.org/, do not actually guarantee correct rounding, say of sin(x) for all x. When I tested it, typically it is rounded correctly for 75% of all x to all bits, but in 25% it gives the second nearest floating point. So it is incredibly accurate. But it is not correctly rounded in 100% of cases.

I assume you do not need things to be correctly rounded, but you want to get the exact bits on all platforms? I think openlibm might be able to do that.

For my codes, I do not require bit reproducibility, as I don't think it's even possible --- with MPI and BLAS and LAPACK, the bits are not guaranteed. Only the overall accuracy. Also I like to use -ffast-math, which again does not guarantee bit reproducibility. Instead, I want performance, and correct results (as verified by tests).

marshallward · 2019-12-17T18:39:04Z

That's right, we are primarily focused on reproducing the bits over a range of configurations, and at least for now we're not too concerned with the overall accuracy of the lowest bits. As an example, if we increase our optimization level, say -O2 to -O3 in gcc, then intrinsics like exp() seem to hop over to different bytecode and give slightly different results.

Ideally, we would like an implementation that we can control, which won't be implicitly chosen by the compiler.

You're right about the issues with other libraries. It's a bit off-topic and I won't go into detail, but for our MPI reduction operations we convert our floating points to an extended fixed-precision format to ensure that the result is invariant to ordering. We also aggressively apply parentheses in all of our expressions, and only use compilers which honor the parenthesis ordering.

Bit reproducibility is extremely challenging, but we have managed to achieve it in our climate models for many years. At the very least, it has vastly improved the quality of our test suites.

The issues around transcendentals appears to be a "final frontier" for us, but so far we have been reluctant to abandon the Fortran intrinsics.

I am not familiar with OpenLibm, but I will look into it more. I'm also not sure if something like this belongs in a standard library, so hoping for more feedback on this issue.

certik · 2019-12-17T18:51:45Z

I am just trying to understand your use case better, so that we can later discuss where this belongs to.

For MPI, you take double precision number, convert to quadruple, accumulate in parallel with random order, and convert back to double?

I wasn't aware you could even get bit reproducibility for a complicated parallel code. How do you handle cases where the hardware itself gives different results for the same binary? That has happened to me once (some nodes on the HPC cluster gave slightly different results than other nodes with the same binary and operating system). My solution was to ensure the code is robust against such issues.

marshallward · 2019-12-17T20:01:43Z

The MPI summation method uses an extended fixed precision method, which basically converts the floats to tuples of integers, each value denoting a fixed precision over a selected range. MPI operations are then applied for the integer data, which is reproducible. It is described in this paper and this source code but I would guess that it's very similar to other extended precision implementations.

We have the luxury of solving hydrostatic fluid dynamics equations - hyperbolic PDEs, with no elliptic pressure solvers - so summations and other collectives are very rare and are primarily used for diagnostics like the total mass and energy of the system, which are computed infrequently, so we can overlook issues like performance of collectives. Others may not be so lucky.

It's hard to summarize everything required to achieve bit reproducibility, but careful ordering of arithmetic using parentheses is a big part. There are other rules, such as restricting the number of divisions in an expression (usually only once). We also do very aggressive testing of initialization and memory accesses, so we never see stray values in our memory (or so we hope). We also never commit any changes to our codebase which modify answers, unless it is done intentionally (e.g. bug fix).

The ability to reproduce such runs is also a part of the tender process, so that may prevent some of the issues you have seen. But there have also been more exotic problems. In one case - before my time with this group - there was a tridiagonal solver which failed to reproduce, and was tracked down to anomalous voltage noise in the CPU, which the vendor acknowledged and had to fix.

I guess the short answer to your question is that due to a combination of software practices and the hardware tender process, we do not see variability across nodes.

certik · 2019-12-17T21:29:44Z

Thanks for the details. I had no idea this was possible. I suspect there might be performance hits in some cases, so perhaps your method might not work in all cases, even if it works for you.

Regarding bit reproducible functions, let me know if the openlibm does what you want. Then we can think if Fortran stdlib should have some functionality for this.

milancurcic · 2019-12-17T23:20:44Z

I think there are a few different problems conflated here:

Being bit-reproducible between consecutive executions of the same binary;
Being bit-reproducible between optimization levels with the same compiler and platform;
Being bit-reproducible between consecutive executions of the same binary that has a parallel reduction operation, on the same system.

I agree that 1) is crucial and stdlib should have it as a requirement.

I think 2) is out the door. Higher optimization flags mean "generate slightly less accurate results faster". The compiler generates different code. Am I misunderstanding this?

is common and known and I think Hallberg&Adcroft's solution is elegant. I have this challenge in my work as well, so am interested in it.

So back to the original post, it seems to me it asks for transcendental function implementations that are guaranteed to be cross-platform bit reproducible (item 1). Am I correct or still not understanding? If yes, I agree this would be useful (though specialized), but a high technical challenge -- I guess it would need multi-platform assembly implementations. If yes, then I'd say it's more fitting for a specialized library rather than a stdlib.

It's also quite possible I'm not understanding this well so please correct me :).

marshallward · 2019-12-18T02:43:07Z

The underlying issue is that if we invoke a transcendental intrinsic, then we really just don't know where it is coming from. For example, we once had our system math library change without our knowledge (possibly libm, but we cannot say for sure) after which we could no longer reproduce the original input fields for selected runs. Currently, the Fortran intrinsics are not a safe option.

The optimization issue is secondary, but also of concern for us. If I compile a Fortran application with -O2 and -O3, then my exp() calls will give different answers, even though I would have thought these were compiled elsewhere and not subject to the optimization of my application code. I find it troubling that intrinsics can change in response to user code optimization.

I am less concerned about a particular implementation changing answers if the math library itself is compiled with different settings, which is what I think you are describing in 2).

While we could just force ourselves to use certain math libraries, such as OpenLibm, I would really like for this to be more controllable within the language itself. I'm still not sure if this is the place to be having this discussion, but it would be nice if there was a safe and "standard" way to address it.

BTW I wanted to say that it was #10 which prompted this discussion for me. Mostly I was just wondering if there was any aspiration to provide bit reproducibility for users in certain controlled situations. The transcendental function is a particular issue for us, but I think that bit reproducibility is a broader question that is worth considering.

certik · 2019-12-18T03:37:56Z

@marshallward the reason exp and other functions change based on optimization levels is that if you are willing to give up a tiny bit of accuracy (I am, but you are not), then you can get a much faster implementation.

As such, it seems the possible bit-reproducible versions of every function would need to be separate functions. Say linalg provides eig that just uses Lapack and Blas and thus is not bit-reproducible, and then we can have linalg_bit with the bit-reproducible versions.

marshallward · 2019-12-18T04:18:50Z

@certik True, though it's not necessarily accuracy that matters in our case. I just want the same numbers!

milancurcic · 2019-12-18T16:51:57Z

The underlying issue is that if we invoke a transcendental intrinsic, then we really just don't know where it is coming from. For example, we once had our system math library change without our knowledge (possibly libm, but we cannot say for sure) after which we could no longer reproduce the original input fields for selected runs. Currently, the Fortran intrinsics are not a safe option.

Okay, thanks, I understand better now. I've had the same issue and I agree it's unsettling. Specific to this issue, I think this is the scenario. I have a Fortran program executable that is dynamically linked to the libraries on system.

ldd my_fortran_executable
	linux-vdso.so.1 (0x00007ffc95b68000)
	libgfortran.so.5 => /lib64/libgfortran.so.5 (0x00007f6ebb0a6000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f6ebad12000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f6ebaafa000)
	libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007f6eba8ba000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f6eba4fc000)
	libz.so.1 => /lib64/libz.so.1 (0x00007f6eba2e5000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6ebb520000)

This executable works well and produces the same output given same input. However, there may come a system update that upgrades libm.so.6 to a new version, but is still ABI-compatible. The executable still works but now produces slightly (or in case of simulating chaotic flows, not so slightly) different results.

The proposal here is to add a mirrored set of transcendental functions that would be expressed in stdlib using only intrinsic arithmetic operators and parentheses to enforce order of operations. Then, you always know what function you're calling.

A few questions come up for me now:

Even with this approach, can we be sure that the compiler will still produce same code between -O0 and -O3? I think the compiler has full reign to make any binary code it wants. Again, when you do -O3 you're kinda telling it "give me different results, but faster". It still seems like a strange requirement or expectation that different optimization levels be bit-reproducible. I don't think they're supposed to be.
Let's say we figure out item 1, this seems like a rather advanced and specialized application. Wouldn't the MOM team have some specialized library in place already? If not, then this likely falls into a hard research problem, which I think puts a bar high for stdlib implementation.

As an aside, a known and popular problem in numerical weather prediction is that if you recompile the same WRF code (or other weather model) with the new compiler version (say, ifort-18 -> ifort-19), you can get the simulated hurricane to make landfall in a completely different US state.

certik · 2019-12-18T17:20:49Z

It's an old debate regarding rearrangement. Theoretically unless you use -ffast-math (which I do), the compiler should keep parentheses (I think even with -O3). My understanding is that in Fortran, unlike C, the compiler is free to rearrange the order of operation, unless you put parentheses in. If you put parentheses in, it should preserve it, unless you use -ffast-math, then it is free to completely rearrange.

That's in theory. In practice, there are all kinds of issues, but apparently @marshallward was able to resolve most of them, which is amazing.

@marshallward do you agree that if you want bit-reproducibility, you might not always get the best performance as if you are willing to sacrifice it?

marshallward · 2019-12-18T18:58:02Z

@milancurcic You've captured the main issue, which is an implicit link to some libm.so.* binary (or a custom library, I suspect, in other cases). We have had situations where there was an ABI-compatible update to libm (or equivalent) which changed answers.

FWIW I am less worried about -O flags changing answers in our own code. I do more or less accept this tradeoff, or at least the responsibility. But it would be nice if the external functions were not affected by those flags.

The MOM team does not yet have custom transcendentals, but we have started to prototype them in Python preprocessing scripts, and may find ourselves using them at some point. I suppose that I am starting to send out feelers to see if something like stdlib is a place to hold such functions.

@certik I don't want to take credit for this work, it is really due to my colleagues at GFDL - Robert Hallberg in particular - who have refined these methods. As a relatively new member of the group, I am more of a student than a practitioner.

The methods do rely on rigid enforcement of parentheses, as you correctly point out. I haven't yet perused the language standard to see what it says about the matter.

As for your question: no, we would not expect such functions or methods to be highly performant. I think everyone understands that there is some computational cost associated with reproducibility.

certik · 2019-12-18T19:16:39Z

As for your question: no, we would not expect such functions or methods to be highly performant. I think everyone understands that there is some computational cost associated with reproducibility.

I see, thanks. In that case, it seems that bit-reproducible capabilities should be implemented in addition to high performance (but not strictly bit-reproducible) capabilities, not instead of them.

warrickball · 2020-12-15T14:48:12Z

I thought I'd emerge from the shadows to voice my support for this proposal. Our specific use case is running the same code on different architectures with different compilers and still obtaining the identical result. This requires a variety of compiler flags and code style but we get correctly-rounded (towards zero, IIRC) mathematical functions from crlibm, which is mirrored on GitHub here. We now routinely test that our outputs are identical when run after building with gfortran on (Intel) Macs, gfortran on Linux (Intel and AMD), ifort on Linux (Intel) and—added recently—gfortran on a Raspberry Pi 4.

marshallward · 2020-12-15T15:27:52Z

@warrickball That is encouraging news. I also supported the idea of testing with a prescribed libm. However, there is the deeper problem that the compiler is under no obligation to use libm and is free to use its own code. For example, I think Intel strongly defers to libimf.a, although I believe this can be overridden. Your own experience seems to confirm that it is possible.

We are moving towards implementing our own transcendentals, with a similar goal of platform-independent results, but we still have some ways to go. (The transcendentals are working, but there are some newly discovered problems...)

arjenmarkus · 2020-12-15T16:19:23Z

Your remark about transcendentals reminded me of an article by Anton Shterenlikht on the subject of transcendentals not too long ago. It focuses on the behaviour at branch cuts and the like. See https://dl.acm.org/doi/10.1145/3301318. Op di 15 dec. 2020 om 16:28 schreef Marshall Ward <[email protected]

…

: @warrickball <https://github.com/warrickball> That is encouraging news. I also supported the idea of testing with a prescribed libm. However, there is the deeper problem that the compiler is under no obligation to use libm and is free to use its own code. For example, I think Intel strongly defers to libimf.a, although I believe this can be overridden. Your own experience seems to confirm that it is possible. We are moving towards implementing our own transcendentals, with a similar goal of platform-independent results, but we still have some ways to go. (The transcendentals are working, but there are some newly discovered problems...) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAN6YRY25KJ25USDGRBMWDLSU56ATANCNFSM4J37JFNQ> .

merge

Review of the specs of hash maps

certik mentioned this issue Dec 17, 2019

What should be part of stdlib? #1

Open

milancurcic changed the title ~~bit-reproducible numerical operations~~ Proposal for bit-reproducible numerical operations Dec 17, 2019

marshallward mentioned this issue Dec 19, 2019

Proposal for linalg #10

Open

awvwgk added topic: algorithms searching and sorting, merging, ... topic: mathematics linear algebra, sparse matrices, special functions, FFT, random numbers, statistics, ... labels Sep 18, 2021

jvdp1 pushed a commit to jvdp1/stdlib that referenced this issue Oct 2, 2021

Merge pull request fortran-lang#12 from Jim-215-Fisher/master

c9e67e7

merge

jvdp1 pushed a commit to jvdp1/stdlib that referenced this issue Jun 18, 2022

Merge pull request fortran-lang#12 from jvdp1/hash_maps_review

dd120fe

Review of the specs of hash maps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for bit-reproducible numerical operations #12

Proposal for bit-reproducible numerical operations #12

certik commented Dec 17, 2019

certik commented Dec 17, 2019

marshallward commented Dec 17, 2019

certik commented Dec 17, 2019

marshallward commented Dec 17, 2019 •

edited

Loading

certik commented Dec 17, 2019

milancurcic commented Dec 17, 2019

marshallward commented Dec 18, 2019 •

edited

Loading

certik commented Dec 18, 2019 •

edited

Loading

marshallward commented Dec 18, 2019

milancurcic commented Dec 18, 2019

certik commented Dec 18, 2019

marshallward commented Dec 18, 2019 •

edited

Loading

certik commented Dec 18, 2019

warrickball commented Dec 15, 2020 •

edited

Loading

marshallward commented Dec 15, 2020

arjenmarkus commented Dec 15, 2020 via email

Proposal for bit-reproducible numerical operations #12

Proposal for bit-reproducible numerical operations #12

Comments

certik commented Dec 17, 2019

certik commented Dec 17, 2019

marshallward commented Dec 17, 2019

certik commented Dec 17, 2019

marshallward commented Dec 17, 2019 • edited Loading

certik commented Dec 17, 2019

milancurcic commented Dec 17, 2019

marshallward commented Dec 18, 2019 • edited Loading

certik commented Dec 18, 2019 • edited Loading

marshallward commented Dec 18, 2019

milancurcic commented Dec 18, 2019

certik commented Dec 18, 2019

marshallward commented Dec 18, 2019 • edited Loading

certik commented Dec 18, 2019

warrickball commented Dec 15, 2020 • edited Loading

marshallward commented Dec 15, 2020

arjenmarkus commented Dec 15, 2020 via email

marshallward commented Dec 17, 2019 •

edited

Loading

marshallward commented Dec 18, 2019 •

edited

Loading

certik commented Dec 18, 2019 •

edited

Loading

marshallward commented Dec 18, 2019 •

edited

Loading

warrickball commented Dec 15, 2020 •

edited

Loading