Stop disabling AVX-512 in `march=native` compiler flag setup #843

tersec · 2020-03-31T10:40:55Z

Not until at least May or June, and maybe later if some places (especially importantly, CI services/images) keep using gcc longer, so that more conservative distributions and other gcc distributors can catch up, but https://github.com/status-im/nim-beacon-chain/blob/daabb1b5b29065f502ae24c91b9ca231aad58446/config.nims#L30-L33
should be removed since the bug was marked RESOLVED FIXED on 2020-02-14 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782#c11 states:

Fixed on the trunk and all release branches.

Presumably meaning gcc 9.x as well as 10.x.

There's no great gain in automatic compiler AVX512 per se for Nimbus, just simpler build configuration with fewer things to explain/understand, so the extent this breaks anything or represents more than negligible risk, it should not be changed.

The text was updated successfully, but these errors were encountered:

arnetheduck · 2020-03-31T11:22:50Z

do we have any data saying that -march=native is measurably useful in our app?

tersec · 2020-03-31T12:03:11Z

do we have any data saying that -march=native is measurably useful in our app?

On my setup just now, a quick test on an i7-8550U running 64-bit Debian, the total time from running five mainnet all_fixtures_require_ssz in a row:

$ cat foo.sh 
for i in $(seq 5); do build/all_fixtures_require_ssz; done

Where that was built with:

./env.sh nim c --out:./build/all_fixtures_require_ssz -d:chronicles_log_level=INFO -d:const_preset=mainnet --verbosity:0 --hints:off --warnings:off -d:usePcreHeader --passL:"-lpcre" -d:release --import:libbacktrace tests/official/all_fixtures_require_ssz.nim

Both with and without, from config.nims:

# This helps especially for 32-bit x86, which sans SSE2 and newer instructions
# requires quite roundabout code generation for cryptography, and other 64-bit      
# and larger arithmetic use cases, along with register starvation issues. When
# engineering a more portable binary release, this should be tweaked but still
# use at least -msse2 or -msse3.                                              
if defined(disableMarchNative):                                                                                                 
  switch("passC", "-msse3")                                                   
else:                                                                           
  switch("passC", "-march=native")
  if defined(windows):                                                                                                   
    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
    # ("-fno-asynchronous-unwind-tables" breaks Nim's exception raising, sometimes)
    switch("passC", "-mno-avx512vl")

Ran in:

nim-beacon-chain$ time bash foo.sh > /dev/null 

real	2m23.122s
user	2m22.909s
sys	0m0.196s

With march=native and

$ time bash foo.sh > /dev/null 

real	2m41.632s
user	2m41.405s
sys	0m0.197s

Without it.

It's say an 11% performance increase (or, alternatively, a 13% performance decrease were it removed) is worth a compile flag and a few lines in config.nims.

Furthermore, this is on x86-64, whic is already reasonably friendly to 64-bit math in its baseline configuration. 32-bit x86 definitely is not, so I'd expect a bigger performance delta there.

mratsim · 2020-04-02T10:16:28Z

I'm surprised by the benchmark:

-march=native is helpful for the following things:

vectorized floating-point math which we don't use
vectorized integer math which we don't use. Technically since at a low-level Milagro uses a carry-less addition implementation it can use vectorization but BLS12-381 is 7 limbs not sure if this is worth it to vectorize (i.e. it's not an array of 1000 elements).
Things like mulx, adcx, adox, the last two being unsupported by GCC/LLVM and the first one not really useful with carryless implementation
LZCNT, TZCNT and other bit operations from BMI / BMI2 instructions

i.e. I don't see how it helps our codebase which is scalar integer everywhere with very few bit manipulation.

Case in point, in my own pure Nim crypto benchmark -march=native did absolutely nothing but using LLVM instead of GCC improved speed by over 80%: mratsim/constantine#17

mratsim · 2020-09-09T14:19:32Z

With the inclusion of BLST -march=native will lead to 20% faster crypto on CPU from 2015 onwards via mulx/adox/adcx instructions: status-im/nim-blst#1

tersec · 2020-09-09T15:36:26Z

Useful information to summarize (in a less technical way) for system requirements, the distinction between platforms which need Milagro, those which use BLST without mulx/adox/adcx, and those which use BLST with mulx/adox/adcx.

It looks like https://en.wikipedia.org/wiki/Intel_ADX (adcx, adox) implies https://en.wikipedia.org/wiki/BMI2 (mulx), so Broadwell or newer on Intel and Ryzen or newer for AMD. The former should be pretty widespread, but the latter is still being deployed, so it's notable enough to point out that for AMD CPUs, pre-Zen is notably slower.

Is BMI2-only (mulx but not adcx/adox) used by BLST? That'd only push Intel requirements back on generation to Haswell, but AMD years earlier.

mratsim · 2020-09-10T20:18:01Z

BLST requires both or SSE3 (for SHA256)

mratsim · 2020-09-16T08:13:31Z

@sinkingsugar had issues with https://github.com/eth2-clients/multinet as the machine that cross-compiled the docker pre-build of NBC was AVX512 aware and so it didn't run on his machine (if I understood correctly).

Also BLST will not require SSE3 soon: status-im/nim-blscurve#84

tersec · 2020-09-16T08:56:02Z

#1664 implements one approach to this. It means that the default march=native builds still work across a broad range of AMD and Intel CPUs from the last several years. When we start providing end-user binary builds, we'll have to revisit this, but it's a reasonable middle ground not requiring @sinkingsugar to disable march=native entirely.

Longer term, I think this issue still specifies the most correct action: march=native means march=native, when the initially triggering gcc bug was closed 7 months ago.

More immediately, I'm not convinced AVX-512 is useful enough in terms of ((auto-vectorization utility)(fraction non-vectorized code)+(utility)(vectorized code) when available)*(availability) to require multinet to use its own compile settings.

mratsim · 2020-09-16T09:03:12Z

I'm not convinced AVX-512 is useful enough in terms of ((auto-vectorization utility)(fraction non-vectorized code)+(utility)(vectorized code) when available)*(availability) to require multinet to use its own compile settings.

You need to have loop with 8x uint64 or 16x uint32 (or floats but we don't use float) for it to be interesting. And it needs sustained operations as in current CPU using AVX512 will downclock CPU frequency by 30%. Furthermore, AVX512 support is a mess flame/blis#352 (comment) because some CPUs have only 1 AVX512 unit per core (Xeon Bronze and some Silver) and some have 2 (Skylake X, some Xeon Silver and Xeon Gold). When you only have 1 unit, it's almost always better to use plain AVX without downclocking.

For our use-case this all those preconditions are only met for SHA256, and we happen to have no AVX512 enabled SHA256 code.

stefantalpalaru · 2020-09-16T12:21:09Z

@sinkingsugar had issues with https://github.com/eth2-clients/multinet as the machine that cross-compiled the docker pre-build of NBC was AVX512 aware and so it didn't run on his machine (if I understood correctly).

Not necessarily AVX512. You can't compile with -march=native on one CPU family and expect it to run on another, due to any of a long list of CPU features present in the former and missing in the latter, but that's already known and worked around since we started pushing binaries in a Docker image: https://github.com/status-im/nim-beacon-chain/blob/52548f079b9ed3d6ad19e9b47a829e9df28a793d/docker/Makefile#L8-L9

sinkingsugar · 2020-09-16T13:00:54Z

@stefantalpalaru is right.
To solve this issue in my c++ projects I generally pick a favor that is good and broadly supported. For example march=sandybridge.

So I suggest to make a pick on a flavor that fits our targets ( when we release binaries )

https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

mratsim · 2020-10-06T11:47:04Z

I think we're fine for now and for Docker/cross-compile builds we have the #766

mratsim closed this as completed Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop disabling AVX-512 in `march=native` compiler flag setup #843

Stop disabling AVX-512 in `march=native` compiler flag setup #843

tersec commented Mar 31, 2020

arnetheduck commented Mar 31, 2020

tersec commented Mar 31, 2020

mratsim commented Apr 2, 2020 •

edited

Loading

mratsim commented Sep 9, 2020

tersec commented Sep 9, 2020 •

edited

Loading

mratsim commented Sep 10, 2020

mratsim commented Sep 16, 2020

tersec commented Sep 16, 2020 •

edited

Loading

mratsim commented Sep 16, 2020 •

edited

Loading

stefantalpalaru commented Sep 16, 2020

sinkingsugar commented Sep 16, 2020 •

edited

Loading

mratsim commented Oct 6, 2020

Stop disabling AVX-512 in march=native compiler flag setup #843

Stop disabling AVX-512 in march=native compiler flag setup #843

Comments

tersec commented Mar 31, 2020

arnetheduck commented Mar 31, 2020

tersec commented Mar 31, 2020

mratsim commented Apr 2, 2020 • edited Loading

mratsim commented Sep 9, 2020

tersec commented Sep 9, 2020 • edited Loading

mratsim commented Sep 10, 2020

mratsim commented Sep 16, 2020

tersec commented Sep 16, 2020 • edited Loading

mratsim commented Sep 16, 2020 • edited Loading

stefantalpalaru commented Sep 16, 2020

sinkingsugar commented Sep 16, 2020 • edited Loading

mratsim commented Oct 6, 2020

Stop disabling AVX-512 in `march=native` compiler flag setup #843

Stop disabling AVX-512 in `march=native` compiler flag setup #843

mratsim commented Apr 2, 2020 •

edited

Loading

tersec commented Sep 9, 2020 •

edited

Loading

tersec commented Sep 16, 2020 •

edited

Loading

mratsim commented Sep 16, 2020 •

edited

Loading

sinkingsugar commented Sep 16, 2020 •

edited

Loading