-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single bench covering classic and NNUE eval #2902
Conversation
simply switch between evals. Needs the default net to be available, also for CI, this needs some work for appveyor (PowerShell)
For fishtest, bench is used to evaluate the speed at which the processor can run SF. A mixed bench won't be very effective at this, as the speed ratio between the regular eval and the NNUE eval can be quite different depending on the machine. But if running a test using regular eval, we care about the regular eval speed, and if running a NNUE test, about the NNUE eval speed. The way fishtest computes the TC in general might need more adjustements, as using the same nps target without checking if NNUE mode is used or not would significantly increase the effective TC. To fully identify a version (using the bench as a signature), having both a HC part and a NNUE part makes sense. This implementation isn't ideal from another perspective. Stockfish is a taxing CPU workload that has some popularity, and so many websites reviewing hardware include a Stockfish chess benchmark. This usually means running Stockfish's own bench with custom threads/depth/hash settings so as to make the test run longer and use all the available thread. The raw nps is typically used as the result, though it's not a direct fit for strength due to multi-threading losses. The version of Stockfish used in such benchmarks doesn't change too often (Phoronix 3990X review had a SF9 and an asmifsh test, while SF11 had recently come out), however it's inevitable that future Stockfish versions will end up being used as well. The introduction of SF-NNUE, which is very taxing on AVX2, could be a further motivation to update the version used in such benchmarks. As this is data that can end up being used by people (in particular chess enthusiasts) to make purchase decisions, it would be nice to make bench performance representative of actual CPU performance. One possible idea is a NNUE-subtest including AVX and a traditional HC eval-subtest, another is a command line parameter (just like depth, threads...) for the eval to use in bench. Arguably some changes on the position set could be useful too. More ambitiously, effective strength of N threads vs 1 thread having N the time could be measured up to say 32 threads, a curve could be fit, and when running the bench with multiple threads, an additional info output line would give a "equivalent 1-thread nps" result through some formula. The raw nps would be representative of performance when running many 1th searches in parallel, while the "equivalent 1th nps" would be representative of peak performance. But this last part is unrelated to the NNUE/HC eval usage in bench. |
``` ./stockfish bench 16 1 13 default depth NNUE ./stockfish bench 16 1 13 default depth mixed ./stockfish bench 16 1 13 default depth classical ```
@Alayan-stk-2 attached patch would leave choice some choice, I think we will need to think a bit on what is best for our purpose. For example, if we make start using NNUE as a default, we would benefit from having the benchmarks based on that. |
note: PR is not complete for non-default FEN files. |
I'm closing this one in favor of #2931 |
simply switch between evals.
Needs the default net to be available, also for CI, this needs some work for appveyor (PowerShell)