Switch default compiler to Clang #3341

gcp · 2021-02-08T19:52:26Z

I remember the Makefile or some documentation saying that Stockfish prefers gcc with the specific flags in the Makefile. In any case, currently the default is gcc: https://github.com/official-stockfish/Stockfish/blob/master/src/Makefile#L303

However, at least on my main workstation (a Haswell), Clang has been consistently faster for Stockfish now. This might've changed around the NNUE stuff, or due to recent Clang updates, not sure, but Clang is about 7-8% faster.

I wonder if:
a) Other people on other hardware can compare gcc9/10 vs clang11/12 and see what produces faster binaries to confirm I'm not imagining this or my hardware is the exception.
b) If it's faster for most people, whether we could consider switching the default? I think LLVM is used by a lot of stuff now so asking for Clang on Linux distros is not disruptive.
c) If it's faster on the TCEC/CCC hardware, maybe those should get different binaries?
d) If we should do some work to get a Clang compile going on Windows. This seems tricky with mingw, though, so maybe a non-mingw version is a first step.

gcp · 2021-02-08T19:52:42Z

As for (d), I only found #1358.

Torom · 2021-02-09T10:08:08Z

On my system a clang build is 2~3 % slower.

./stockfish-git compiler 
Stockfish 090221 by the Stockfish developers (see AUTHORS file)

Compiled by g++ (GNUC) 10.2.0 on Linux
Compilation settings include:  64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
__VERSION__ macro expands to: 10.2.0

./stockfish-git_clang compiler
Stockfish 090221 by the Stockfish developers (see AUTHORS file)

Compiled by clang++ 11.0.1 on Linux
Compilation settings include:  64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
__VERSION__ macro expands to: Clang 11.0.1

./pyshbench ../stockfish-git ../stockfish-git_clang 100
==================
base (...tockfish-git) =    2184064  +/- 6591
test (...sh-git_clang) =    2130868  +/- 3807
diff                   =     -53196  +/- 8002

speedup        = -0.0244
P(speedup > 0) =  0.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

gcp · 2021-02-09T10:58:44Z

Result of 100 runs
==================
base (...ockfish_gcc9) =    1845945  +/- 14086
test (...fish_clang12) =    1867970  +/- 13434
diff                   =     +22024  +/- 4256

speedup        = +0.0119
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz
Hyperthreading: on

I seem to have a local patch that changes the -flto=thin override for Clang in the Makefile back to to -flto. Thin-LTO was supposed to be an improvement, but if you look at the benchmarks http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html notice that it is actually slower for chess and go engines.

Does your result flip around with regular LTO?

gcp · 2021-02-09T11:34:20Z

On Zen 1 I see no difference:

Result of 100 runs
==================
base (...ockfish_gcc9) =    1511804  +/- 3785
test (...ang12fulllto) =    1511122  +/- 3790
diff                   =       -682  +/- 4658

speedup        = -0.0005
P(speedup > 0) =  0.3872

CPU: 8 x AMD Ryzen 7 1700 Eight-Core Processor
Hyperthreading: on

But also here regular LTO is faster:

Result of 100 runs
==================
base (...ang12thinlto) =    1515230  +/- 2664
test (...ang12fulllto) =    1526217  +/- 2881
diff                   =     +10987  +/- 3202

speedup        = +0.0073
P(speedup > 0) =  1.0000

CPU: 8 x AMD Ryzen 7 1700 Eight-Core Processor
Hyperthreading: on

Torom · 2021-02-09T11:41:05Z

-flto is still slower than gcc:

./pyshbench ../stockfish-git ../stockfish-git_clang_flto 100
==================
base (...tockfish-git) =    2186091  +/- 5677
test (...t_clang_flto) =    2162191  +/- 3980
diff                   =     -23901  +/- 7884

speedup        = -0.0109
P(speedup > 0) =  0.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

but definitely faster than -flto=thin:

./pyshbench ../stockfish-git_clang ../stockfish-git_clang_flto 100
==================
base (...sh-git_clang) =    2121055  +/- 5148
test (...t_clang_flto) =    2161406  +/- 4471
diff                   =     +40351  +/- 7771

speedup        = +0.0190
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

gcp · 2021-02-09T12:45:02Z

Ivy Bridge, gcc 10 vs clang-11, clearly favors gcc

Result of 100 runs
==================
base (...ckfish_gcc10) =    1769288  +/- 464
test (...fish_clang11) =    1701570  +/- 400
diff                   =     -67718  +/- 583

speedup        = -0.0383
P(speedup > 0) =  0.0000

CPU: 4 x Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Hyperthreading: on

gcp · 2021-02-09T13:02:20Z

Okay, I guess my main desktop is actually the exception here, and for most machines gcc is as good or may be significantly better.

I'll prepare a pull request to fix the LTO default, I think we've confirmed that's a universal gain.

The default Clang configuration uses ThinLTO instead of the standard LTO. When published, benchmarks already showed this to be inferior for chess engines: http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html Benchmarking with current Clang 12 shows that the above result still holds and ThinLTO is a pessimization, see issue official-stockfish#3341. The commit that added ThinLTO did not contain any benchmarking. While ThinLTO may compile faster, that's not a worthwhile tradeoff for a chess engine. With ThinLTO disabled, Clang 12 will outperform GCC 9 and 10 for some architectures, e.g. Intel Haswell.

Torom · 2021-02-09T13:52:19Z

One more -flto vs. -flto=thin clang10:

./pyshbench ../stockfish_clang_thin ../stockfish_clang_lto 100
==================
base (...h_clang_thin) =    2052135  +/- 501
test (...sh_clang_lto) =    2073530  +/- 629
diff                   =     +21395  +/- 704

speedup        = +0.0104
P(speedup > 0) =  1.0000

CPU: 24 x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Hyperthreading: on

gcc 9.3.0 vs. clang 10:

./pyshbench ../stockfish_gcc ../stockfish_clang_lto 100
==================
base (...tockfish_gcc) =    2137430  +/- 521
test (...sh_clang_lto) =    2074978  +/- 408
diff                   =     -62453  +/- 696

speedup        = -0.0292
P(speedup > 0) =  0.0000

CPU: 24 x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Hyperthreading: on

Benchmarking with current Clang 12 shows that and ThinLTO is a pessimization, see issue official-stockfish#3341. closes official-stockfish#3345 No functional change.

Torom · 2021-02-10T20:10:01Z

After your two clang paches, clang is almost as strong as gcc on my system. Very impressive.

gcc 10.2.0 vs. clang 11.0.1:

./pyshbench ../stockfish-git ../stockfish-git_clang 100
==================
base (...tockfish-git) =    2186535  +/- 5842
test (...sh-git_clang) =    2175076  +/- 4119
diff                   =     -11460  +/- 8120

speedup        = -0.0052
P(speedup > 0) =  0.0029

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

Benchmarking with current Clang 12 shows that and ThinLTO is a pessimization, see issue official-stockfish#3341. closes official-stockfish#3345 No functional change.

Passed cutechess STC: Score of RF 14 vs RF 3: 1124 - 536 - 1130 [0.605] 2790 ... RF 14 playing White: 620 - 246 - 530 [0.634] 1396 ... RF 14 playing Black: 504 - 290 - 600 [0.577] 1394 ... White vs Black: 910 - 750 - 1130 [0.529] 2790 Elo difference: 74.3 +/- 10.0, LOS: 100.0 %, DrawRatio: 40.5 % SPRT: llr 2.95 (100.2%), lbound -2.94, ubound 2.94 - H1 was accepted Passed cutechess LTC: Score of RF 14 vs RF 3: 875 - 355 - 1062 [0.613] 2292 ... RF 14 playing White: 519 - 153 - 475 [0.660] 1147 ... RF 14 playing Black: 356 - 202 - 587 [0.567] 1145 ... White vs Black: 721 - 509 - 1062 [0.546] 2292 Elo difference: 80.2 +/- 10.4, LOS: 100.0 %, DrawRatio: 46.3 % SPRT: llr 2.95 (100.0%), lbound -2.94, ubound 2.94 - H1 was accepted Bench: 4918790 (+24 squashed commit) Squashed commit: [5118c15] Use Bitboard over Square in movegen It uses pos.checkers() on target when movegen is the type of EVASION. It simplify the code. And it's also expected a slightly speed up, because Bitboard is more direct when doing bitwise. Passed STC: LLR: 2.93 (-2.94,2.94) {-1.25,0.25} Total: 28176 W: 2506 L: 2437 D: 23233 Ptnml(0-2): 80, 1904, 10063, 1949, 92 https://tests.stockfishchess.org/tests/view/60421d18ddcba5f0627bb6a9 Passed LTC: LLR: 2.93 (-2.94,2.94) {-0.75,0.25} Total: 9704 W: 402 L: 341 D: 8961 Ptnml(0-2): 3, 279, 4230, 334, 6 https://tests.stockfishchess.org/tests/view/60422823ddcba5f0627bb6ae closes official-stockfish#3383 No functional change [42b44ee] Deal with commented lines in UCI input commands starting with '#' as the first character will be ignored closes official-stockfish#3378 No functional change [550d3d8] Add Stockfish namespace. fixes official-stockfish#3350 and is a small cleanup that might make it easier to use SF in separate projects, like a NNUE trainer or similar. closes official-stockfish#3370 No functional change. [6ccec01] Clean functions returning by const values The codebase contains multiple functions returning by const-value. This patch is a small cleanup making those function returns by value instead, removing the const specifier. closes official-stockfish#3328 No functional change [5801707] Import author list and copyright years Easier to do it this way than track all the cherry picks one by one. [fd5fc27] Use correct chess terms + fix spelling. - "discovered check" (instead of "discovery check") - "en passant" (instead of "en-passant") - "pseudo-legal" before a noun (instead of "pseudo legal") - "3-fold" (instead of "3fold") closes official-stockfish#3294 No functional change. [3d8a301] Better code for hash table generation This patch removes some magic numbers in TT bit management and introduce proper constants in the code, to improve documentation and ease further modifications. No function change [f314344] Allow TT entries with key16==0 to be fetched Fix the issue where a TT entry with key16==0 would always be reported as a miss. Instead, we'll use depth8 to detect whether the TT entry is occupied. In order to do that, we'll change DEPTH_OFFSET to -7 (depth8==0) to distinguish between an unoccupied entry and the otherwise lowest possible depth, i.e., DEPTH_NONE (depth8==1). To prevent a performance regression, we'll reorder the TT entry fields by the access order of TranspositionTable::probe(). Memory in general works fastest when accessed in sequential order. We'll also match the store order in TTEntry::save() with the entry field order, and re-order the 'if-or' expressions in TTEntry::save() from the cheapest to the most expensive. Finally, as we now have a proper TT entry occupancy test, we'll fix a minor corner case with hashfull reporting. To reproduce: - Use a big hash - Either: a. Start 31 very quick searches (this wraparounds generation to 0); or b. Force generation of the first search to 0. - go depth infinite Before the fix, hashfull would incorrectly report nearly full hash immediately after the search start, since TranspositionTable::hashfull() used to consider only the entry generation and not whether the entry was actually occupied. STC: LLR: 2.95 (-2.94,2.94) {-0.25,1.25} Total: 36848 W: 4091 L: 3898 D: 28859 Ptnml(0-2): 158, 2996, 11972, 3091, 207 https://tests.stockfishchess.org/tests/view/5f3f98d5dc02a01a0c2881f7 LTC: LLR: 2.95 (-2.94,2.94) {0.25,1.25} Total: 32280 W: 1828 L: 1653 D: 28799 Ptnml(0-2): 34, 1428, 13051, 1583, 44 https://tests.stockfishchess.org/tests/view/5f3fe77a87a5c3c63d8f5332 closes official-stockfish#3048 Bench: 3742162 [5b421ab] Enable New Pass Manager for Clang. It's about 1% speedup for Stockfish. Result of 100 runs ================== base (...fish_clang12) = 1946851 +/- 3717 test (./stockfish ) = 1967276 +/- 3408 diff = +20425 +/- 2438 speedup = +0.0105 P(speedup > 0) = 1.0000 Thanks to David Major for making me aware of this part of LLVM development. closes official-stockfish#3346 No functional change [f7f7e38] Disable ThinLTO when using Clang. Benchmarking with current Clang 12 shows that and ThinLTO is a pessimization, see issue official-stockfish#3341. closes official-stockfish#3345 No functional change. [6373f89] Simplify Chess 960 castling a little cleanup, and small speedup (about 0.3%) for Chess 960. Verified with perft on a large set of chess960 positions. Closes official-stockfish#3317 No functional change [3f4d84c] Speed Up Perft Search It speeds up generate<LEGAL>, and thus perft, roughly by 2-3%. closes official-stockfish#3312 No functional change [88974f5] Clean Up Castling in gives_check There is no need to add rto or kto on the Bitboard which represents the pieces. STC: LLR: 2.93 (-2.94,2.94) {-1.25,0.25} Total: 57064 W: 5096 L: 5067 D: 46901 Ptnml(0-2): 202, 3862, 20355, 3931, 182 https://tests.stockfishchess.org/tests/view/6005ea2c6019e097de3efa55 LTC: LLR: 2.92 (-2.94,2.94) {-0.75,0.25} Total: 30088 W: 1094 L: 1052 D: 27942 Ptnml(0-2): 10, 882, 13217, 926, 9 https://tests.stockfishchess.org/tests/view/6006115a6019e097de3efa6e closes official-stockfish#3311 No functional change. [2129ec9] Avoid more expensive legality check speedup of the code, enough to pass STC, failed LTC. Passed STC: LLR: 2.93 (-2.94,2.94) {-0.25,1.25} Total: 68928 W: 6334 L: 6122 D: 56472 Ptnml(0-2): 233, 4701, 24369, 4943, 218 https://tests.stockfishchess.org/tests/view/6002747f6019e097de3ef8dc Failed LTC: LLR: -2.96 (-2.94,2.94) {0.25,1.25} Total: 44560 W: 1702 L: 1675 D: 41183 Ptnml(0-2): 25, 1383, 19438, 1408, 26 https://tests.stockfishchess.org/tests/view/6002a4836019e097de3ef8e3 About 1% speedup: Result of 50 runs ================== base (...kfish.master) = 2237500 +/- 7428 test (...ckfish.patch) = 2267003 +/- 7017 diff = +29503 +/- 4774 speedup = +0.0132 P(speedup > 0) = 1.0000 closes official-stockfish#3304 No functional change. [289fbcc] Use stable sort to make sure bench with TB yields same results everywhere. std::sort() is not stable so different implementations can produce different results: use the stable version instead. Observed for '8/6k1/5r2/8/8/8/1K6/Q7 w - - 0 1' yielding different bench results for gcc and MSVC and 3-4-5 syzygy TB prior to this patch. closes official-stockfish#3083 No functional change. [5a344f1] Fix parallel LTO issues on Windows This adds -save-temps to the linker flags when parallel LTO is used on MinGW/MSYS. fixes official-stockfish#2977 closes official-stockfish#2978 No functional change. [5da0f55] Parallelize Link Time Optimization for GCC, CLANG and MINGW This patch tries to run multiple LTO threads in parallel, speeding up the build process of optimized builds if the -j make parameter is used. This mitigates the longer linking times of optimized builds since the integration of the NNUE code. Roughly 2x build speedup. I've tried a similar patch some two years ago but it ran into trouble with old compiler versions then. Since we're on the C++17 standard now these old compilers should be obsolete. closes official-stockfish#2943 No functional change. [630995a] Remove unnecessay legality check Possible after the recent reording pos.legal(move) check official-stockfish#2941 No functional change. [6f512d9] Do move legality check before pruning. This alllows to simplify the code because the move counter haven't to be decremented later if a move isn't legal. As a side effect now illegal pruned moves doesn't included anymore in move counter. So slightly less pruning and reductions are done. STC: LLR: 2.94 (-2.94,2.94) {-1.50,0.50} Total: 111016 W: 21106 L: 21077 D: 68833 Ptnml(0-2): 1830, 13083, 25736, 12946, 1913 https://tests.stockfishchess.org/tests/view/5f28816fa5abc164f05e4c26 LTC: LLR: 2.94 (-2.94,2.94) {-1.50,0.50} Total: 39264 W: 4909 L: 4843 D: 29512 Ptnml(0-2): 263, 3601, 11854, 3635, 279 https://tests.stockfishchess.org/tests/view/5f297902a5abc164f05e4c8e closes official-stockfish#2906 Bench: 3795876 [7f6f1b8] Remove pawn tables as well, they're unused Bench: 3865928 [c7fa058] We don't need specialized endgame eval where we're going [2e3cd4d] Remove piece lists This patch removes the incrementally updated piece lists from the Position object. This has been tried before but always failed. My reasons for trying again are: * 32-bit systems (including phones) are now much less important than they were some years ago (and are absent from fishtest); * NNUE may have made SF less finely tuned to the order in which moves were generated. STC: LLR: 2.94 (-2.94,2.94) {-1.25,0.25} Total: 55272 W: 5260 L: 5216 D: 44796 Ptnml(0-2): 208, 4147, 18898, 4159, 224 https://tests.stockfishchess.org/tests/view/5fc2986a42a050a89f02c926 LTC: LLR: 2.96 (-2.94,2.94) {-0.75,0.25} Total: 16600 W: 673 L: 608 D: 15319 Ptnml(0-2): 14, 533, 7138, 604, 11 https://tests.stockfishchess.org/tests/view/5fc2f98342a050a89f02c95c closes official-stockfish#3247 Bench: 3940967 [693c459] We don't need PSQTs where we're going [c6c710d] Initial import of simple eval

gcp closed this as completed Feb 9, 2021

gcp mentioned this issue Feb 9, 2021

Disable ThinLTO when using Clang. #3345

Closed

gcp mentioned this issue Feb 10, 2021

Enable New Pass Manager for Clang. #3346

Closed

vondele pushed a commit to vondele/Stockfish that referenced this issue Feb 10, 2021

Disable ThinLTO when using Clang.

b15e3b3

Benchmarking with current Clang 12 shows that and ThinLTO is a pessimization, see issue official-stockfish#3341. closes official-stockfish#3345 No functional change.

Fanael pushed a commit to Fanael/Stockfish that referenced this issue Mar 7, 2021

Disable ThinLTO when using Clang.

f7f7e38

Benchmarking with current Clang 12 shows that and ThinLTO is a pessimization, see issue official-stockfish#3341. closes official-stockfish#3345 No functional change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch default compiler to Clang #3341

Switch default compiler to Clang #3341

gcp commented Feb 8, 2021 •

edited

Loading

gcp commented Feb 8, 2021

Torom commented Feb 9, 2021

gcp commented Feb 9, 2021 •

edited

Loading

gcp commented Feb 9, 2021

Torom commented Feb 9, 2021

gcp commented Feb 9, 2021

gcp commented Feb 9, 2021

Torom commented Feb 9, 2021

Torom commented Feb 10, 2021

Switch default compiler to Clang #3341

Switch default compiler to Clang #3341

Comments

gcp commented Feb 8, 2021 • edited Loading

gcp commented Feb 8, 2021

Torom commented Feb 9, 2021

gcp commented Feb 9, 2021 • edited Loading

gcp commented Feb 9, 2021

Torom commented Feb 9, 2021

gcp commented Feb 9, 2021

gcp commented Feb 9, 2021

Torom commented Feb 9, 2021

Torom commented Feb 10, 2021

gcp commented Feb 8, 2021 •

edited

Loading

gcp commented Feb 9, 2021 •

edited

Loading