"Casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C."
Let's look a bit closer at some popular passive benchmarks...
DMIPS are 'Dhrystone MIPS', the single-threaded score of the Dhrystone benchmark developed 1984 (four decades ago!) as an improvement over the older MIPS metric ('million instructions per second' which became kinda pointless back then with the 'RISC vs. CISC' battle).
Written in the programming languages popular back then (FORTRAN, PL/1, SAL, ALGOL 68, and Pascal) and suited for the single-core CPUs of that time (almost every modern MCU is outperforming now) Dhrystone results do not represent anything that runs on today's computers. DMIPS since decades are misleading for the following reasons (quoted from Wikipedia and an ARM White Paper):
- Dhrystone features unusual code that is not usually representative of modern real-life programs.
- Dhrystone is susceptible to compiler optimizations. For example, it does a lot of string copying in an attempt to measure string copying performance. However, the strings in Dhrystone are of known constant length and their starts are aligned on natural boundaries, two characteristics usually absent from real programs. Therefore, an optimizer can replace a string copy with a sequence of word moves without any loops, which will be much faster. This optimization consequently overstates system performance, sometimes by more than 30%.
- Dhrystone's small code size may fit in the instruction cache of a modern CPU, so that instruction fetch performance is not rigorously tested. Similarly, Dhrystone may also fit completely in the data cache, thus not exercising data cache miss performance. To counter fits-in-the-cache problem, the SPECint benchmark was created in 1988 to include a suite of (initially 8) much larger programs (including a compiler) which could not fit into L1 or L2 caches of that era.
- Dhrystone numbers actually reflect the performance of the C compiler and libraries, probably more so than the performance of the processor itself
- Dhrystone’s execution is largely spent in standard C library functions, such as strcmp(),strcpy(), and memcpy(). Compiler vendors generally provide these libraries that are typically optimized and hand-written in assembly language. While you may think you are benchmarking a processor, you are really benchmarking are the compiler writer’s optimizations of the C library functions for a particular platform
Maybe even more concerning is the completely flawed way those scores are generated in the wild. And of course the results you find somewhere on the net usually lack all the important info (like which OS, which libs and which compiler with which flags has been used).
Using the dhrystonePi64
binary (Dhrystone Benchmark, Version 2.1, Language: C or C++)' from http://www.roylongbottom.org.uk/dhrystone%20results.htm on a RK3588 device with four A76 CPU cores combined with four A55 while switching the memory clockspeed between 2112 (performance
DMC governor) and 528 MHz (powersave
DMC governor) we get these results:
Dhrystone 2.1 result | A76 / 2112 MHz | A76 / 528 MHz | A55 / 2112 MHz | A55 / 528 MHz |
---|---|---|---|---|
Nanoseconds one Dhrystone run | 30.43 | 30.85 | 90.96 | 90.84 |
Dhrystones per Second | 32860262 | 32415115 | 10994401 | 11008762 |
VAX MIPS rating | 18702.48 | 18449.13 | 6257.49 | 6265.66 |
As can be seen memory clock doesn't matter at all since Dhrystone was already critized decades ago for its small working set fitting completely into CPU caches of that era.
When limiting the A76 CPU cores to the same 1.8 GHz the A55 are clocked with we get this result and as such a DMIPS/MHz comparison ratio:
Nanoseconds one Dhrystone run: 39.15
Dhrystones per Second: 25542413
VAX MIPS rating = 14537.51
The VAX MIPS ratings generated with same dhrystone
binary suggest the A76 being 2.32 faster than an A55 at same clockspeed (14540 / 6260 = 2.32). Interesting since places like Wikipedia tell us A76 would be 3.5 – 4.1 times faster than the A55 of this popular DynamIQ pairing (see table below). What went wrong at Wikipedia? Maybe ignoring Dhrystone being more a compiler than a hardware benchmark in the 'fire and forget' mode it's always used?
One of the few examples of using Dhrystone in a non flawed way (same Dhrystone binary as such same compiler version and same compiler flags and on the same OS image as such same libraries) it looks like this with few different ARMv8 Cortex cores:
- A35 – 1.7 DMIPS/MHz
- A53 – 2.2 DMIPS/MHz
- A57 – 4.1 DMIPS/MHz
- A72 – 4.5 DMIPS/MHz
- A73 – 4.8 DMIPS/MHz
- A75 – 6.1 DMIPS/MHz
- A77 – 7.3 DMIPS/MHz
But of course you also find totally different numbers all over the web, for example at Wikipedia, Baselabs and even two differing DMIPS/MHz listings at bluelucky.
ARM Core | Measured | Wikipedia | Baselabs | bluelucky 1 | bluelucky 2 |
---|---|---|---|---|---|
A5 | 1.57 | ||||
A7 | 1.9 | 1.9 | 1.9 | ||
A8 | 2.0 | 2.0 | 2.0 | ||
A9 | 2.5 | 2.0 | 2.5 | ||
A15 | 3.5 | 4.0 | 3.4 | ||
A17 | 2.8 | 4.0 | 3.2 | ||
A32 | 2.3 | 2.3 | |||
A35 | 1.7 | 1.78 | 2.5 | 2.5 | |
A53 | 2.2 | 2.3 | 2.3 | 2.3 | 2.3 |
A55 | 3 | 3 | 2.3 | 2.7 | |
A57 | 4.1 | 4.1 – 4.8 | 4.6 | 4.1 | |
A72 | 4.5 | 6.3 – 7.3 | 7.4 | 5.4 | 4.7 |
A73 | 4.8 | 7.4 – 8.5 | 7.0 | 4.8 | |
A75 | 6.1 | 8.2 – 9.5 | 7.0 | 5.2 | |
A76 | 10.7 – 12.4 | 12 | |||
A77 | 7.3 | 13 – 16 |
The correctly measured Dhrystone MIPS/MHz score suggests Cortex-A72 (an out-of-order big core) being more than twice as fast as the corresponding Cortex-A53 (an in-order little core meant to be combined with A72/A73 for big.LITTLE hybrid CPU designs). But when trusting into Wikipedia A72 is almost 3 times faster. And with the Cortex-A77 for example it gets even more weird since Wikipedia numbers and correctly determined differ even more.
Blender is a popular open source render engine/tool that got an own benchmark mode/tool few years ago. Since I was interested in Apple's raytracing functionality introduced with their M3 SoCs (this little patch does the magic from version 4.0.0 on) I compared 4.0.0 with 3.6.0 scores:
GPU | 4.0.0 score | 3.6.0 score | difference |
---|---|---|---|
Apple M3 Max (GPU - 40 cores) | 3417.29 | 3014.83 | 113.3% |
Apple M3 Pro (GPU - 18 cores) | 1510.37 | 1314.46 | 114.9% |
So 'hardware raytracing' makes up for a less than 15% performance improvement? Let's have a closer look whether benchmark scores done with different versions can be compared in the first place...
Grabbing data from https://opendata.blender.org/ on 22th Nov 2023 and filtering out all devices with less than 4 scores (52 GPU models remaining) we see a 'drop in performance' with 46 of them compared to the older 3.6.0 version. Especially Nvidia GPUs are affected (RTX 4060 Ti being the 'worst') and Apple's SoCs as such we can assume that the benefit of having HW accelerated raytracing on the M3 SoCs accounts for a performance improvement in Blender more close to 20%.
Comparing 4.0.0 with 3.6.0 in detail
grep "^\"" Blender-4.0.0.csv | while read ; do
Device="$(awk -F'"' '{print $2}' <<<"${REPLY}")"
Score4="$(awk -F'"' '{print $4}' <<<"${REPLY}")"
Score3="$(grep "\"${Device}\"" Blender-3.6.0.csv | awk -F'"' '{print $4}')"
Diff="$(awk '{printf ("%0.1f",100*$1/$2); }' <<<"${Score4} ${Score3}")"
echo -e "| ${Device} | ${Score4} | ${Score3} | ${Diff}% |"
done | sort -t '|' -k 5 -n
GPU | 4.0.0 score | 3.6.0 score | difference |
---|---|---|---|
NVIDIA GeForce RTX 4060 Ti | 3451.59 | 4306.28 | 80.2% |
NVIDIA GeForce RTX 2060 | 1541.86 | 1851.51 | 83.3% |
NVIDIA GeForce RTX 2070 | 2074.64 | 2441.76 | 85.0% |
NVIDIA GeForce RTX 4090 | 11337.02 | 13093.11 | 86.6% |
NVIDIA GeForce RTX 3080 Ti | 5253.9 | 6055.71 | 86.8% |
NVIDIA GeForce RTX 3070 Ti | 3557.07 | 4092.95 | 86.9% |
NVIDIA GeForce RTX 4060 | 3056.69 | 3482.13 | 87.8% |
NVIDIA GeForce RTX 3080 | 4605.96 | 5227.13 | 88.1% |
NVIDIA GeForce RTX 3070 | 3268.63 | 3704.15 | 88.2% |
NVIDIA GeForce GTX 1660 Ti | 753.36 | 851.8 | 88.4% |
NVIDIA GeForce RTX 2060 SUPER | 2167.41 | 2449.2 | 88.5% |
NVIDIA GeForce RTX 3060 Ti | 2835.63 | 3195.27 | 88.7% |
NVIDIA GeForce RTX 4080 Laptop GPU | 5650.66 | 6371.23 | 88.7% |
NVIDIA GeForce RTX 3060 | 2246.81 | 2531.17 | 88.8% |
NVIDIA GeForce RTX 4070 Ti | 6514.48 | 7290.21 | 89.4% |
NVIDIA GeForce RTX 4080 | 8558.09 | 9575.48 | 89.4% |
NVIDIA GeForce RTX 3090 | 5651.84 | 6289.07 | 89.9% |
NVIDIA GeForce RTX 2080 SUPER | 2357.52 | 2617.67 | 90.1% |
NVIDIA GeForce RTX 4090 Laptop GPU | 7388.08 | 8203.46 | 90.1% |
NVIDIA GeForce GTX 1660 SUPER | 749.46 | 830.35 | 90.3% |
NVIDIA GeForce RTX 4050 Laptop GPU | 2610.53 | 2889.12 | 90.4% |
NVIDIA GeForce RTX 3050 Laptop GPU | 1212.3 | 1340.07 | 90.5% |
NVIDIA GeForce RTX 3060 Laptop GPU | 2390.45 | 2617.27 | 91.3% |
NVIDIA GeForce RTX 4060 Laptop GPU | 3351.88 | 3645.67 | 91.9% |
NVIDIA GeForce RTX 4070 Laptop GPU | 3674.3 | 3999.65 | 91.9% |
Apple M2 Max (GPU - 38 cores) | 1765.03 | 1914.88 | 92.2% |
NVIDIA GeForce GTX 1070 | 528.56 | 573.36 | 92.2% |
NVIDIA GeForce RTX 2070 SUPER | 2398.9 | 2602.16 | 92.2% |
NVIDIA GeForce RTX 2080 Ti | 3075.79 | 3333.86 | 92.3% |
NVIDIA GeForce RTX 4070 | 5581.39 | 6028.95 | 92.6% |
Apple M1 Max (GPU - 32 cores) | 933.21 | 1006.63 | 92.7% |
NVIDIA GeForce GTX 1080 Ti | 829.75 | 894.47 | 92.8% |
AMD Radeon RX 6800 | 1793.94 | 1929.72 | 93.0% |
NVIDIA GeForce RTX 3070 Ti Laptop GPU | 3071.26 | 3287.45 | 93.4% |
AMD Radeon RX 7800 XT | 2270 | 2427.85 | 93.5% |
Apple M2 Max (GPU - 30 cores) | 1451.12 | 1550.73 | 93.6% |
Apple M1 (GPU - 8 cores) | 249.92 | 265.98 | 94.0% |
Apple M2 Ultra (GPU - 76 cores) | 3214.87 | 3420.98 | 94.0% |
Intel Arc A770 Graphics | 1980.98 | 2106.39 | 94.0% |
AMD Radeon RX 6700 XT | 1490.09 | 1566.79 | 95.1% |
Apple M1 Pro (GPU - 16 cores) | 469.32 | 487.02 | 96.4% |
Apple M1 Max (GPU - 24 cores) | 774.97 | 796.77 | 97.3% |
AMD Radeon RX 7900 XTX | 3958.38 | 3980.96 | 99.4% |
AMD Radeon RX 6900 XT | 2597.21 | 2611.39 | 99.5% |
NVIDIA RTX A4000 | 3397.06 | 3408.64 | 99.7% |
AMD Radeon RX 6800 XT | 2432.05 | 2437.77 | 99.8% |
Intel Arc A750 Graphics | 2058.68 | 2054.02 | 100.2% |
NVIDIA GeForce RTX 3070 Laptop GPU | 3171.62 | 3161.06 | 100.3% |
AMD Radeon RX 6950 XT | 2776.02 | 2751.71 | 100.9% |
AMD Radeon RX 6700 | 1404.47 | 1347.65 | 104.2% |
Apple M3 Max (GPU - 40 cores) | 3417.29 | 3014.83 | 113.3% |
Apple M3 Pro (GPU - 18 cores) | 1510.37 | 1314.46 | 114.9% |
Does this only affect 3.6.0 vs. 4.0.0 so that we at least can rely on Blender 3.x scores to be comparable? Nope, there it's even worse. 3.6.0 vs. 3.0.1 ends up with some GPUs becoming 'three to four times faster'.
Comparing 3.6.0 with 3.0.1 in detail
grep "^\"" Blender-3.6.0.csv | while read ; do
Device="$(awk -F'"' '{print $2}' <<<"${REPLY}")"
Score4="$(awk -F'"' '{print $4}' <<<"${REPLY}")"
Score3="$(grep "\"${Device}\"" Blender-3.0.1.csv | awk -F'"' '{print $4}')"
Diff="$(awk '{printf ("%0.1f",100*$1/$2); }' <<<"${Score4} ${Score3}")"
echo -e "| ${Device} | ${Score4} | ${Score3} | ${Diff}% |"
done | sort -t '|' -k 5 -n
GPU | 3.6.0 score | 3.0.1 score | difference |
---|---|---|---|
NVIDIA GeForce GTX 660 | 124.11 | 150.92 | 82.2% |
NVIDIA GeForce GTX 1060 6GB | 390.68 | 443.65 | 88.1% |
NVIDIA GeForce GTX 1050 | 185.12 | 208.68 | 88.7% |
NVIDIA GeForce GTX 1070 | 573.36 | 624.6 | 91.8% |
NVIDIA GeForce RTX 3070 Ti Laptop GPU | 3287.45 | 3495.73 | 94.0% |
NVIDIA Quadro RTX 4000 | 2342.57 | 2485.81 | 94.2% |
NVIDIA GeForce GTX 1650 | 480.56 | 505.67 | 95.0% |
NVIDIA GeForce GTX 1050 Ti | 231.88 | 242.73 | 95.5% |
NVIDIA GeForce GTX 1080 Ti | 894.47 | 935.9 | 95.6% |
NVIDIA Quadro RTX 6000 | 3370.77 | 3521.78 | 95.7% |
NVIDIA GeForce GTX 1080 | 621.83 | 643.01 | 96.7% |
NVIDIA GeForce GTX 1660 Ti | 851.8 | 879.17 | 96.9% |
NVIDIA GeForce GTX 1650 Ti | 518.39 | 533.94 | 97.1% |
NVIDIA GeForce GTX 970 | 323.6 | 333.36 | 97.1% |
NVIDIA GeForce GTX 1660 | 777.73 | 799.05 | 97.3% |
NVIDIA GeForce RTX 3080 Laptop GPU | 3300.05 | 3378.88 | 97.7% |
NVIDIA GeForce GTX 1660 SUPER | 830.35 | 849.14 | 97.8% |
NVIDIA GeForce RTX 2060 SUPER | 2449.2 | 2487.79 | 98.4% |
NVIDIA GeForce RTX 2070 with Max-Q Design | 2026.81 | 2055.26 | 98.6% |
NVIDIA GeForce GTX 1060 | 380.67 | 385.74 | 98.7% |
NVIDIA GeForce RTX 2080 Ti | 3333.86 | 3373.16 | 98.8% |
NVIDIA GeForce RTX 3060 | 2531.17 | 2513.31 | 100.7% |
NVIDIA GeForce RTX 3050 | 1659.88 | 1629.28 | 101.9% |
NVIDIA GeForce RTX 2060 | 1851.51 | 1809.77 | 102.3% |
NVIDIA GeForce RTX 2080 | 2549.7 | 2490.22 | 102.4% |
NVIDIA GeForce RTX 3060 Ti | 3195.27 | 3120.36 | 102.4% |
AMD Radeon RX 5700 XT | 955.06 | 932.15 | 102.5% |
NVIDIA GeForce RTX 2080 SUPER | 2617.67 | 2535.25 | 103.3% |
NVIDIA GeForce RTX 2070 SUPER | 2602.16 | 2505.17 | 103.9% |
NVIDIA GeForce RTX 3080 | 5227.13 | 5029.25 | 103.9% |
NVIDIA GeForce RTX 3070 Laptop GPU | 3161.06 | 3023.2 | 104.6% |
NVIDIA GeForce RTX 3070 | 3704.15 | 3506.28 | 105.6% |
NVIDIA RTX A6000 | 5785.7 | 5472.1 | 105.7% |
NVIDIA GeForce RTX 3080 Ti | 6055.71 | 5711.42 | 106.0% |
NVIDIA GeForce RTX 3070 Ti | 4092.95 | 3849.24 | 106.3% |
NVIDIA GeForce RTX 2070 | 2441.76 | 2252.86 | 108.4% |
NVIDIA GeForce RTX 3090 | 6289.07 | 5764.34 | 109.1% |
NVIDIA GeForce RTX 3060 Laptop GPU | 2617.27 | 2372.01 | 110.3% |
NVIDIA GeForce RTX 3050 Laptop GPU | 1340.07 | 1207.17 | 111.0% |
AMD Radeon RX 6700 XT | 1566.79 | 1359.52 | 115.2% |
AMD Radeon RX 6900 XT | 2611.39 | 2262.86 | 115.4% |
AMD Radeon RX 6700S | 918.46 | 789.23 | 116.4% |
NVIDIA GeForce RTX 3080 Ti Laptop GPU | 3978.32 | 3385.31 | 117.5% |
AMD Radeon RX 5500 XT | 506.08 | 428.89 | 118.0% |
AMD Radeon RX 6800 XT | 2437.77 | 2061.51 | 118.3% |
AMD Radeon PRO W6800 | 1880.56 | 1584.62 | 118.7% |
AMD Radeon RX 6600 XT | 1103.97 | 928.68 | 118.9% |
AMD Radeon RX 6600 | 1011.36 | 850.45 | 118.9% |
NVIDIA GeForce RTX 3050 Ti Laptop GPU | 1514.43 | 1253.98 | 120.8% |
NVIDIA Tesla T4 | 1727.73 | 445.36 | 387.9% |
NVIDIA RTX A2000 8GB Laptop GPU | 1473.89 | 375.63 | 392.4% |
As usual: scores generated with different software versions can't be compared!