Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop disabling AVX-512 in march=native compiler flag setup #843

Closed
tersec opened this issue Mar 31, 2020 · 12 comments
Closed

Stop disabling AVX-512 in march=native compiler flag setup #843

tersec opened this issue Mar 31, 2020 · 12 comments

Comments

@tersec
Copy link
Contributor

tersec commented Mar 31, 2020

Not until at least May or June, and maybe later if some places (especially importantly, CI services/images) keep using gcc longer, so that more conservative distributions and other gcc distributors can catch up, but https://github.com/status-im/nim-beacon-chain/blob/daabb1b5b29065f502ae24c91b9ca231aad58446/config.nims#L30-L33
should be removed since the bug was marked RESOLVED FIXED on 2020-02-14 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782#c11 states:

Fixed on the trunk and all release branches.

Presumably meaning gcc 9.x as well as 10.x.

There's no great gain in automatic compiler AVX512 per se for Nimbus, just simpler build configuration with fewer things to explain/understand, so the extent this breaks anything or represents more than negligible risk, it should not be changed.

@arnetheduck
Copy link
Member

do we have any data saying that -march=native is measurably useful in our app?

@tersec
Copy link
Contributor Author

tersec commented Mar 31, 2020

do we have any data saying that -march=native is measurably useful in our app?

On my setup just now, a quick test on an i7-8550U running 64-bit Debian, the total time from running five mainnet all_fixtures_require_ssz in a row:

$ cat foo.sh 
for i in $(seq 5); do build/all_fixtures_require_ssz; done

Where that was built with:

./env.sh nim c --out:./build/all_fixtures_require_ssz -d:chronicles_log_level=INFO -d:const_preset=mainnet --verbosity:0 --hints:off --warnings:off -d:usePcreHeader --passL:"-lpcre" -d:release --import:libbacktrace tests/official/all_fixtures_require_ssz.nim

Both with and without, from config.nims:

# This helps especially for 32-bit x86, which sans SSE2 and newer instructions
# requires quite roundabout code generation for cryptography, and other 64-bit      
# and larger arithmetic use cases, along with register starvation issues. When
# engineering a more portable binary release, this should be tweaked but still
# use at least -msse2 or -msse3.                                              
if defined(disableMarchNative):                                                                                                 
  switch("passC", "-msse3")                                                   
else:                                                                           
  switch("passC", "-march=native")
  if defined(windows):                                                                                                   
    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
    # ("-fno-asynchronous-unwind-tables" breaks Nim's exception raising, sometimes)
    switch("passC", "-mno-avx512vl")

Ran in:

nim-beacon-chain$ time bash foo.sh > /dev/null 

real	2m23.122s
user	2m22.909s
sys	0m0.196s

With march=native and

$ time bash foo.sh > /dev/null 

real	2m41.632s
user	2m41.405s
sys	0m0.197s

Without it.

It's say an 11% performance increase (or, alternatively, a 13% performance decrease were it removed) is worth a compile flag and a few lines in config.nims.

Furthermore, this is on x86-64, whic is already reasonably friendly to 64-bit math in its baseline configuration. 32-bit x86 definitely is not, so I'd expect a bigger performance delta there.

@mratsim
Copy link
Contributor

mratsim commented Apr 2, 2020

I'm surprised by the benchmark:

-march=native is helpful for the following things:

  • vectorized floating-point math which we don't use
  • vectorized integer math which we don't use. Technically since at a low-level Milagro uses a carry-less addition implementation it can use vectorization but BLS12-381 is 7 limbs not sure if this is worth it to vectorize (i.e. it's not an array of 1000 elements).
  • Things like mulx, adcx, adox, the last two being unsupported by GCC/LLVM and the first one not really useful with carryless implementation
  • LZCNT, TZCNT and other bit operations from BMI / BMI2 instructions

i.e. I don't see how it helps our codebase which is scalar integer everywhere with very few bit manipulation.

Case in point, in my own pure Nim crypto benchmark -march=native did absolutely nothing but using LLVM instead of GCC improved speed by over 80%: mratsim/constantine#17

@mratsim
Copy link
Contributor

mratsim commented Sep 9, 2020

With the inclusion of BLST -march=native will lead to 20% faster crypto on CPU from 2015 onwards via mulx/adox/adcx instructions: status-im/nim-blst#1

@tersec
Copy link
Contributor Author

tersec commented Sep 9, 2020

Useful information to summarize (in a less technical way) for system requirements, the distinction between platforms which need Milagro, those which use BLST without mulx/adox/adcx, and those which use BLST with mulx/adox/adcx.

It looks like https://en.wikipedia.org/wiki/Intel_ADX (adcx, adox) implies https://en.wikipedia.org/wiki/BMI2 (mulx), so Broadwell or newer on Intel and Ryzen or newer for AMD. The former should be pretty widespread, but the latter is still being deployed, so it's notable enough to point out that for AMD CPUs, pre-Zen is notably slower.

Is BMI2-only (mulx but not adcx/adox) used by BLST? That'd only push Intel requirements back on generation to Haswell, but AMD years earlier.

@mratsim
Copy link
Contributor

mratsim commented Sep 10, 2020

BLST requires both or SSE3 (for SHA256)

@mratsim
Copy link
Contributor

mratsim commented Sep 16, 2020

@sinkingsugar had issues with https://github.com/eth2-clients/multinet as the machine that cross-compiled the docker pre-build of NBC was AVX512 aware and so it didn't run on his machine (if I understood correctly).

Also BLST will not require SSE3 soon: status-im/nim-blscurve#84

@tersec
Copy link
Contributor Author

tersec commented Sep 16, 2020

#1664 implements one approach to this. It means that the default march=native builds still work across a broad range of AMD and Intel CPUs from the last several years. When we start providing end-user binary builds, we'll have to revisit this, but it's a reasonable middle ground not requiring @sinkingsugar to disable march=native entirely.

Longer term, I think this issue still specifies the most correct action: march=native means march=native, when the initially triggering gcc bug was closed 7 months ago.

More immediately, I'm not convinced AVX-512 is useful enough in terms of ((auto-vectorization utility)(fraction non-vectorized code)+(utility)(vectorized code) when available)*(availability) to require multinet to use its own compile settings.

@mratsim
Copy link
Contributor

mratsim commented Sep 16, 2020

I'm not convinced AVX-512 is useful enough in terms of ((auto-vectorization utility)(fraction non-vectorized code)+(utility)(vectorized code) when available)*(availability) to require multinet to use its own compile settings.

You need to have loop with 8x uint64 or 16x uint32 (or floats but we don't use float) for it to be interesting. And it needs sustained operations as in current CPU using AVX512 will downclock CPU frequency by 30%. Furthermore, AVX512 support is a mess flame/blis#352 (comment) because some CPUs have only 1 AVX512 unit per core (Xeon Bronze and some Silver) and some have 2 (Skylake X, some Xeon Silver and Xeon Gold). When you only have 1 unit, it's almost always better to use plain AVX without downclocking.

For our use-case this all those preconditions are only met for SHA256, and we happen to have no AVX512 enabled SHA256 code.

@stefantalpalaru
Copy link
Contributor

@sinkingsugar had issues with https://github.com/eth2-clients/multinet as the machine that cross-compiled the docker pre-build of NBC was AVX512 aware and so it didn't run on his machine (if I understood correctly).

Not necessarily AVX512. You can't compile with -march=native on one CPU family and expect it to run on another, due to any of a long list of CPU features present in the former and missing in the latter, but that's already known and worked around since we started pushing binaries in a Docker image: https://github.com/status-im/nim-beacon-chain/blob/52548f079b9ed3d6ad19e9b47a829e9df28a793d/docker/Makefile#L8-L9

@sinkingsugar
Copy link
Contributor

sinkingsugar commented Sep 16, 2020

@stefantalpalaru is right.
To solve this issue in my c++ projects I generally pick a favor that is good and broadly supported. For example march=sandybridge.

So I suggest to make a pick on a flavor that fits our targets ( when we release binaries )

https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

@mratsim
Copy link
Contributor

mratsim commented Oct 6, 2020

I think we're fine for now and for Docker/cross-compile builds we have the #766

@mratsim mratsim closed this as completed Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants