gcc: xtensa: make trying to replace 'l32r' with 'movi' + 'slli' regardless of optimizing for size or not, because 'l32r' is much slower than the latter on ESP8266 #33

jjsuwa-sys3175 · 2021-12-19T15:23:54Z

** constant loading benchmark test **

** adjacent 3 loading, 100000 times **
MOVI instruction  : 400180 cycles (4.00 cycles/loop)
constant synthesis: 700000 cycles (7.00 cycles/loop)
L32R instruction  : 2000000 cycles (20.00 cycles/loop)

** adjacent 4 loading, 100000 times **
MOVI instruction  : 500179 cycles (5.00 cycles/loop)
constant synthesis: 900181 cycles (9.00 cycles/loop)
L32R instruction  : 2700180 cycles (27.00 cycles/loop)

** adjacent 5 loading, 100000 times **
MOVI instruction  : 600181 cycles (6.00 cycles/loop)
constant synthesis: 1100180 cycles (11.00 cycles/loop)
L32R instruction  : 3300000 cycles (33.00 cycles/loop)

** adjacent 6 loading, 100000 times **
MOVI instruction  : 700000 cycles (7.00 cycles/loop)
constant synthesis: 1300179 cycles (13.00 cycles/loop)
L32R instruction  : 4100180 cycles (41.00 cycles/loop)

(Arduino sketch is here)

it concludes:

MOVI instruction : 1 cycle/load
constant synthesis: 2 cycles/load
L32R instruction : 6 ~ 8 cycles/load

on ESP8266.

the refman says this behavior is implementation-specific:

This functionality (IRAM/IROM as data) is provided for initialization and test purposes, for which performance is not critical, so these operations may be significantly slower on some Xtensa implementations.

Xtensa(R) Instruction Set Reference Manual, "4.5.8 General RAM/ROM Option Features"

earlephilhower · 2021-12-19T19:25:10Z

Can you compare the generated binary sizes, please, for a non-trivial example? Maybe one of the webserver ones?

I'm worried it may grow somewhat by replacing a single instruction and constant (which might be shared now, saving more space) with multiple instructions.

…dless of optimizing for size or not because 'l32r' is much slower than the latter on ESP8266.

jjsuwa-sys3175 · 2021-12-19T22:04:12Z

I'm worried it may grow somewhat by replacing a single instruction and constant (which might be shared now, saving more space) with multiple instructions.

until now, the replacement occurs only if optimizing for size (-Os, default setting for Arduino core) because reciprocal throughput of L32R may reach 1 cycle;
(see #20 (comment))
however for ESP8266, that assumption is not correct.

Again, -Os was specified in platform.txt already, thus replaciing L32R (+ 4-byte literal) to MOVI.n + SLLI was always done unless the option was changed to -O2.

earlephilhower

Looks reasonable, thanks. This might help BearSSL performance as it's built -O2 not like the standard core at -Os.

gcc: xtensa: make trying to replace 'l32r' with 'movi' + 'slli' regar…

985faa6

…dless of optimizing for size or not because 'l32r' is much slower than the latter on ESP8266.

jjsuwa-sys3175 force-pushed the L32R_is_slow branch from e777469 to 985faa6 Compare December 19, 2021 21:59

earlephilhower approved these changes Dec 20, 2021

View reviewed changes

earlephilhower merged commit 5cf578c into earlephilhower:master Dec 20, 2021

jjsuwa-sys3175 deleted the L32R_is_slow branch December 20, 2021 02:20

jjsuwa-sys3175 restored the L32R_is_slow branch June 18, 2022 19:17

jjsuwa-sys3175 deleted the L32R_is_slow branch June 18, 2022 19:18

earlephilhower mentioned this pull request Jun 21, 2022

GCC 10.3 broken via backports #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcc: xtensa: make trying to replace 'l32r' with 'movi' + 'slli' regardless of optimizing for size or not, because 'l32r' is much slower than the latter on ESP8266 #33

gcc: xtensa: make trying to replace 'l32r' with 'movi' + 'slli' regardless of optimizing for size or not, because 'l32r' is much slower than the latter on ESP8266 #33

jjsuwa-sys3175 commented Dec 19, 2021

earlephilhower commented Dec 19, 2021

jjsuwa-sys3175 commented Dec 19, 2021

earlephilhower left a comment

gcc: xtensa: make trying to replace 'l32r' with 'movi' + 'slli' regardless of optimizing for size or not, because 'l32r' is much slower than the latter on ESP8266 #33

gcc: xtensa: make trying to replace 'l32r' with 'movi' + 'slli' regardless of optimizing for size or not, because 'l32r' is much slower than the latter on ESP8266 #33

Conversation

jjsuwa-sys3175 commented Dec 19, 2021

earlephilhower commented Dec 19, 2021

jjsuwa-sys3175 commented Dec 19, 2021

earlephilhower left a comment

Choose a reason for hiding this comment