Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcc: xtensa: make trying to replace 'l32r' with 'movi' + 'slli' regardless of optimizing for size or not, because 'l32r' is much slower than the latter on ESP8266 #33

Merged
merged 1 commit into from
Dec 20, 2021

Conversation

jjsuwa-sys3175
Copy link

** constant loading benchmark test **

** adjacent 3 loading, 100000 times **
MOVI instruction  : 400180 cycles (4.00 cycles/loop)
constant synthesis: 700000 cycles (7.00 cycles/loop)
L32R instruction  : 2000000 cycles (20.00 cycles/loop)

** adjacent 4 loading, 100000 times **
MOVI instruction  : 500179 cycles (5.00 cycles/loop)
constant synthesis: 900181 cycles (9.00 cycles/loop)
L32R instruction  : 2700180 cycles (27.00 cycles/loop)

** adjacent 5 loading, 100000 times **
MOVI instruction  : 600181 cycles (6.00 cycles/loop)
constant synthesis: 1100180 cycles (11.00 cycles/loop)
L32R instruction  : 3300000 cycles (33.00 cycles/loop)

** adjacent 6 loading, 100000 times **
MOVI instruction  : 700000 cycles (7.00 cycles/loop)
constant synthesis: 1300179 cycles (13.00 cycles/loop)
L32R instruction  : 4100180 cycles (41.00 cycles/loop)

(Arduino sketch is here)

it concludes:

  • MOVI instruction : 1 cycle/load
  • constant synthesis: 2 cycles/load
  • L32R instruction : 6 ~ 8 cycles/load

on ESP8266.

the refman says this behavior is implementation-specific:

This functionality (IRAM/IROM as data) is provided for initialization and test purposes, for which performance is not critical, so these operations may be significantly slower on some Xtensa implementations.

Xtensa(R) Instruction Set Reference Manual, "4.5.8 General RAM/ROM Option Features"

@earlephilhower
Copy link
Owner

Can you compare the generated binary sizes, please, for a non-trivial example? Maybe one of the webserver ones?

I'm worried it may grow somewhat by replacing a single instruction and constant (which might be shared now, saving more space) with multiple instructions.

…dless of optimizing for size or not

because 'l32r' is much slower than the latter on ESP8266.
@jjsuwa-sys3175 jjsuwa-sys3175 changed the title gcc: xtensa: make always trying to replace 'l32r' with 'movi' + 'slli', because 'l32r' is much slower than the latter on ESP8266 gcc: xtensa: make trying to replace 'l32r' with 'movi' + 'slli' regardless of optimizing for size or not, because 'l32r' is much slower than the latter on ESP8266 Dec 19, 2021
@jjsuwa-sys3175
Copy link
Author

I'm worried it may grow somewhat by replacing a single instruction and constant (which might be shared now, saving more space) with multiple instructions.

until now, the replacement occurs only if optimizing for size (-Os, default setting for Arduino core) because reciprocal throughput of L32R may reach 1 cycle;
(see #20 (comment))
however for ESP8266, that assumption is not correct.

Again, -Os was specified in platform.txt already, thus replaciing L32R (+ 4-byte literal) to MOVI.n + SLLI was always done unless the option was changed to -O2.

Copy link
Owner

@earlephilhower earlephilhower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, thanks. This might help BearSSL performance as it's built -O2 not like the standard core at -Os.

@earlephilhower earlephilhower merged commit 5cf578c into earlephilhower:master Dec 20, 2021
@jjsuwa-sys3175 jjsuwa-sys3175 deleted the L32R_is_slow branch December 20, 2021 02:20
@jjsuwa-sys3175 jjsuwa-sys3175 restored the L32R_is_slow branch June 18, 2022 19:17
@jjsuwa-sys3175 jjsuwa-sys3175 deleted the L32R_is_slow branch June 18, 2022 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants