You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The same ratio is 1M nps for SF16.1 with 32 processes. It is therefore likely Zuppa system running SF is still limited by RAM bandwidth even though it represents the highest possible on consumer motherboards (dual channel).
Furthermore, the net in dev is 20% larger and this issue is also expected to become even more severe in the future as CPU speeds advance faster than RAM bandwidth.
The only solution I see is to raise the default threads for each test from 1 to 2. This may require the fastchess migration to prevent time losses as TC is scaled down. The solution of reducing processes equal to number of physical cores is still RAM bandwidth limited now and reducing processes further results in bad CPU utilization.
Additionally, the method in which we measure the nps of a worker is invalid. Currently we run one process with a bench and one process with the search of n-1 threads. This doesn't account for these RAM bandwidth limitations discussed and therefore the measured nps is far faster than the real nps.
The text was updated successfully, but these errors were encountered:
To verify that the "solution" works, one would also need to have the result of running 16process@2threads (ideally also 8@4 and 4@8 and ...) and see the nps. Would be great if you could collect some data for that.
On the other hand, we need to think once if there is no way to reduce SF memory BW needs.
The script from https://github.com/official-monty/montytools/blob/main/BenchNormalization/benchNormToolSF.py was ran with SF16.1 on 2 Ryzen 9 7950X (Eco mode off) systems. Both systems had DDR5 6000Mhz RAM but one had only 1x16GB (Ciekce) whereas the other had 2x16GB (Zuppa). The results (Final Average Benchmark NPS over a minute) are below:
Ciekce (1x16GB):
1 Process: 1869666
32 Processes: 332322
Zuppa (2x16GB):
1 Process: 1814631
32 Processes: 511087
The system with double the bandwidth had 54% higher nps when running 32 processes.
The script from https://github.com/official-monty/montytools/blob/main/BenchNormalization/benchNormToolMonty.py was ran with the monty chess engine on the same systems providing a reference for without RAM bandwidth limitation:
Ciekce (1x16GB):
1 Process: 685668
32 Processes: 365731
Zuppa (2x16GB):
1 Process: 688477
32 Processes: 370616
The same ratio is 1M nps for SF16.1 with 32 processes. It is therefore likely Zuppa system running SF is still limited by RAM bandwidth even though it represents the highest possible on consumer motherboards (dual channel).
Furthermore, the net in dev is 20% larger and this issue is also expected to become even more severe in the future as CPU speeds advance faster than RAM bandwidth.
The only solution I see is to raise the default threads for each test from 1 to 2. This may require the fastchess migration to prevent time losses as TC is scaled down. The solution of reducing processes equal to number of physical cores is still RAM bandwidth limited now and reducing processes further results in bad CPU utilization.
Additionally, the method in which we measure the nps of a worker is invalid. Currently we run one process with a bench and one process with the search of n-1 threads. This doesn't account for these RAM bandwidth limitations discussed and therefore the measured nps is far faster than the real nps.
The text was updated successfully, but these errors were encountered: