Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Enable 'iree-llvmcpu-reassociate-fp-reductions' by default #13685

Merged
merged 2 commits into from
May 19, 2023

Conversation

dcaballe
Copy link
Contributor

This PR is mostly to have a discussion about enabling fp reduction reassociation by default. When this flag is disabled, we are basically not vectorizing the reduction dimension at all, which results in extra unrolling of scalar instructions. I think it's a bit difficult that an external user really understands the implications of this flag and that it has to be enabled to get some performance on fp reductions. In my case, even though I'm aware of it, I always forget to use it and always waste time looking into "why this random sequence of instruction is scalar?" in the profiles. Also, in terms of accuracy, I think we are not paying the same attention to other operations that we are approximating by default (e.g., Math ops), so I think enabling this by default and providing a flag to disable it's reasonable. In the future, we may think about introducing a compilation mode that optimizes for accuracy more generally. WDYT?

This PR is mostly to have a discussion about enabling fp reduction
reassociation by default. When this flag is disabled, we are basically
not vectorizing the reduction dimension at all, which results in extra
unrolling of scalar instructions. I think it's a bit difficult that an
external user really understands the implications of this flag and that
it has to be enabled to get some performances on fp reductions. In my
case, even though I'm aware of it, I always forget to use it and always
waste time looking into "why this random sequence ofinstruction is scalar?"
in the profiles. Also, in terms of accuracy, I think we are not paying
the same attention to other operations that we are approximating by
default (e.g., Math ops), so I think enabling this by default and
providing a flag to disable it's reasonable. In the future, we may think
about introducing a compilation mode that optimizes for accuracy more
generally. WDYT?
@dcaballe dcaballe added benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:comp-stats Run default compilation statistics benchmarks benchmarks:android-cpu Run default Android CPU benchmarks labels May 18, 2023
@MaheshRavishankar
Copy link
Contributor

I am fine with enabling by default. For the other math ops that have similar precision issues, we should guard it by the same flag.....

@MaheshRavishankar
Copy link
Contributor

The tests failures seem reasonable.... I am OK with enabling this flag, but just beware, we might get a lot of precision related bugs, cause we are now changing the precision.

@github-actions
Copy link

github-actions bot commented May 18, 2023

Abbreviated Benchmark Summary

@ commit e1b9c7a6c9f56dda4e1aeb5cdf383c749bf04da1 (vs. base 4795765b831d56d326fdc6de07e73e74dc7e405c)

Regressed Latencies 🚩

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileNetV3Small\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_task(embedded\_elf)[4-thread,full-inference,system-scheduling] with zeros @ pixel-6-pro[little-core] 25.574 (vs. 22.033, 16.07%↑) 25.598 0.277
MobileSSD\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_task(embedded\_elf)[4-thread,full-inference,system-scheduling] with zeros @ pixel-6-pro[little-core] 152.524 (vs. 138.154, 10.40%↑) 152.373 0.665

Improved Latencies 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileNetV3Small\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags] local\_task(vmvx\_module)[4-thread,full-inference,system-scheduling] with zeros @ pixel-6-pro[big-core] 957.449 (vs. 1209.381, 20.83%↓) 957.428 16.119
MobileNetV2\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags] local\_task(vmvx\_module)[4-thread,full-inference,system-scheduling] with zeros @ pixel-6-pro[big-core] 4752.208 (vs. 5843.818, 18.68%↓) 4805.548 115.906
MobileNetV2\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,mmt4d] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with zeros @ pixel-6-pro[big-core] 20.715 (vs. 22.722, 8.83%↓) 20.874 0.306

[Top 3 out of 17 results showed]

No improved or regressed compilation metrics 🏖️

For more information:

Source Workflow Run

@dcaballe dcaballe enabled auto-merge (squash) May 19, 2023 06:29
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this, but please keep in mind that there could be numerical issues. (It would be awesome if we can control this kind of options under a single flag.)

)

iree_check_single_backend_test_suite(
name = "check_regression_llvm-cpu",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we exclude the test in this test suite?

@dcaballe dcaballe merged commit 55a2780 into iree-org:main May 19, 2023
@hanhanW
Copy link
Contributor

hanhanW commented May 19, 2023

I did not notice that auto-merged was enabled.. Sorry about that if you intended to land it later..

@dcaballe
Copy link
Contributor Author

Sorry, my fault. I sometimes enable it when I think the review is mostly done or it's a trivial change to minimize the turnaround if the PR is approved. Let me exclude the test.

@hanhanW
Copy link
Contributor

hanhanW commented May 19, 2023

I'm surprised that we can pass the test. I thought it should fail when I read the file name.

EDIT: I'm good with current state, I'm just surprised.

@dcaballe
Copy link
Contributor Author

The test would fail but it's not part of the src or BACKEND_TESTS lists so it's not running for that suit where the reassoc flag is enabled by default.

dcaballe added a commit that referenced this pull request May 20, 2023
dcaballe added a commit that referenced this pull request May 20, 2023
hanhanW pushed a commit to hanhanW/iree that referenced this pull request May 26, 2023
…e-org#13685)

This PR enables fp reduction reassociation by default. When this flag is disabled, we are basically not vectorizing the reduction dimension at all, which results in extra unrolling of scalar instructions. It's difficult that an external user really understands the implications of this flag and that it has to be enabled to get some performance on fp reductions.
hanhanW pushed a commit to hanhanW/iree that referenced this pull request May 26, 2023
…e-org#13685)

This PR enables fp reduction reassociation by default. When this flag is disabled, we are basically not vectorizing the reduction dimension at all, which results in extra unrolling of scalar instructions. It's difficult that an external user really understands the implications of this flag and that it has to be enabled to get some performance on fp reductions.
NatashaKnk pushed a commit to NatashaKnk/iree that referenced this pull request Jul 6, 2023
…e-org#13685)

This PR enables fp reduction reassociation by default. When this flag is disabled, we are basically not vectorizing the reduction dimension at all, which results in extra unrolling of scalar instructions. It's difficult that an external user really understands the implications of this flag and that it has to be enabled to get some performance on fp reductions.
NatashaKnk pushed a commit to NatashaKnk/iree that referenced this pull request Jul 6, 2023
hanhanW pushed a commit to hanhanW/iree that referenced this pull request Jul 18, 2023
…e-org#13685)

This PR enables fp reduction reassociation by default. When this flag is disabled, we are basically not vectorizing the reduction dimension at all, which results in extra unrolling of scalar instructions. It's difficult that an external user really understands the implications of this flag and that it has to be enabled to get some performance on fp reductions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks:android-cpu Run default Android CPU benchmarks benchmarks:comp-stats Run default compilation statistics benchmarks benchmarks:x86_64 Run default x86_64 benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants